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APPENDIX  A 

EXTENDING  FILE  SYSTEMS  TO  DISTRIBUTED  SYSTEMS 

N.  J.  Liresey 

A.1  INTRODUCTION 

This  paper  examines  the  so-called  'meta-system*  approach  to  Distributed 
System  construction;  that  is,  constructing  a  distributed  system  by  utilizing 
existing  local  operating  systems.  In  particular  it  looks  at  some  of  the  file 
system  problems  that  may  be  encountered  when  such  an  approach  is  followed. 
These  problems  are  simplified  if  one  has  the  opportunity  to  rewrite  the  under¬ 
lying  local  operating  system  (see  (Oppen  81),  for  example],  but  typically, 
this  is  not  the  case.  Since  it  is  impractical  to  look  at  all  existing  local 
operating  systems,  I  focus  on  a  particular  local  operating  system,  Prlmos, 
with  its  overlying  user  interface,  the  Georgia  Tech  Software  Tools  Subsystem. 
However,  these  comments  do  not  apply  to  just  this  environment;  much  of  what  is 
said  is  probably  true  for  many  existing  local  operating  systems. 

A.2  GENERAL  PROBLEMS 

The  problem  with  most  existing  local  operating  systems  is  precisely  that 
they  do  preexist  the  design  to  distribute.  Although  a  local  operating  system 
should  have  some  autonomy,  it  should  also  have  been  designed  with  an  eye  to 
integration,  if  it  is  to  be  useful  in  a  distributed  system.  There  seem  to  be 
two  classes  of  problems  in  extending  existing  operating  systems  to  new 
purposes: 

"Deficiencies"  in  the  existing  local  operating  system  can  make  it  very 
difficult  and  involved  to  perform  new  functions  in  a  reasonable  way.  There 
are  very  few  things  that  any  operating  system  will  make  it  impossible  to  do, 
but  if  the  system  was  originally  built  without  them  in  mind,  it  can  lead  to 
contortions,  and  contortions  lead  to  inefficiencies. 

"Biases”  in  the  design  of  an  existing  system  can  often  lead  one  to 
extend  it  in  oertaln  ways,  without  fully  exploring  alternatives  which  might  be 
equally  valid,  and  the  ease  or  difficulty  of  adding  features  to  an  existing 
structure  can  dose  off  debates  on  the  best  new  features  to  add. 
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The  solutions  suggested  in  this  paper  are  Intended  to  avoid  these 
problems  at  minimum  cost,  rather  than  to  produce  radical,  but  expensive, 
solutions. 

A.2.1  Mamlne  and  Addressing 

"Meta"  distributed  operating  systems  are  produced  by  introducing  a 
network  operating  system  on  top  of  previously  separate  local  operating 
systems.  This  network  operating  system  must  at  least  make  it  possible  to 
allow  a  job  on  one  machine  to  access  files  on  another,  and  it  should 
preferably  allow  a  user  on  one  machine  to  run  a  Job  on  another  without  having 
to  log  on  to  the  second  machine. 

File  naming  is  a  problem  area  in  meta-distributed  systems  because  the 
naming  'space'  of  these  systems  is  usually  the  union  of  the  naming  spaces  of 
their  component  local  operating  systems.  In  order  to  address  resources  in  the 
total  system,  one  needs  to  Introduce  mechanisms  to  allow  the  user  to  address 
outside  the  local  system  on  which  he  may  be  running. 

Ideally,  one  would  like  a  single  name  space  for  the  entire  system, 
rather  than  connected  individual  name  spaces.  Why  is  this  not  easy  in  a  meta¬ 
system? 

A.2.1. 1  File  System  Naming 

He  need  first  to  allow  across-machine  file  access.  This  is  easily 
achieved  by  running  a  server  process  in  the  system  which  accepts  requests  from 
jobs  running  on  one  node  to  access  files  on  another  node.  This  may  be  a 
central  process  or  a  distributed  one. 

Usually  this  server  is  oapable  of  dealing  with  both  local  names,  which 
are  Interpreted  in  the  name  space  of  the  ourrent  machine  or  of  a  local  direc¬ 
tory,  and  global  names,  which  are  interpreted  in  a  space  consisting  of  all  the 
machines  which  are  currently  operating.  ,.From  the  user's  point  of  view,  there 
are  also  relative  and  absolute  names.  For  example: 

•  The  unadorned  pathname: 

macros 

might  be  relative  to  my  current  directory  (set  by  the  'od*f 
change  directory  oommand),  and  returns  a  file  called  'maoros', 
if  it  exists,  in  the  directory  to  whioh  I  am  attached. 
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•  The  absolute  pathname: 

/uc/ jon/macros 

returns  tmaoros'  which  is  on  node  C,  if  node  C  is  accessible, 
irrespective  of  which  machine  I  am  currently  working  on.  The 
pathname  element  /uc  simply  indicates  machine  C  in  the  network. 

•  The  relative  pathname: 

//jon/macros 

is  a  pathname  relative  to  the  current  system,  and  returns  the 
"nearest"  file  called  'macros'.  "Near"  is  determined  by  the  way 
in  which  the  logical  disks  are  ordered  by  the  systems 
administrator.  Local  disks  are  always  "nearer"  than  remote 
disks. 

File  systems  are  only  special  cases  of  name-to-address  mapping 
mechanisms.  At  each  directory  level  in  a  file  system,  you  can  tell  a  direc¬ 
tory  the  name  (of  a  file  or  directory)  in  its  name  space,  and  it  will  respond 
with  an  address,  leading  you  to  the  file,  or  to  a  directory  whose  address 
space  contains  the  rest  of  the  pathname.  So, 

/ub/ jon/macros 

is  interpreted  (mapped)  first  by  a  top-level  directory  which  strips  off  ’/ub' 
and  maps  you  to  a  directory  * Jon'  on  a  particular  node,  node  B,  where  a  direc¬ 
tory  strips  off  'Jon'  and  maps  you  to  a  file  'maoros'  in  its  address  space. 
Graphically,  this  can  be  represented  as  a  tree,  where  each  path  through  the 
tree  leads  to  one  leaf: 

root 

/  \ 

/  \ 

/  \ 
ub  uf 

/  \  \ 

/  \  \ 

fred  Jon  Jon 

/  \  /  \  /  \ 

/  \  /  \  /  \ 
other  macros  other  macros  other  macros 

Here,  /ub  and  /uf  point  to  the  roots  of  the  file  systems  on  maohines  B  and  F. 

A  relative  pathname  starts  at  any  given  internal  node  of  the  tree 
(determined  by  the  last  'od',  change  directory,  command)  and  an  absolute  path¬ 
name  starts  at  'root'.  Since  there  is  only  one  path  from  a  given  internal 
node  to  a  given  leaf,  there  is  no  ambiguity,  once  we  know  at  whloh  internal 
node  to  start. 
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A. 2. 1.2  Aliasing 

In  addition,  I  may  also  have  a  template  or  alias  file  which  will  perform 
transliteration  or  aliasing  of  file  names  allowing  one  file  name  to  masquerade 
as  another,  but  which  will  not  perform  any  interpretation  (i.e.  mapping). 
The  interpretation  is  performed  after  the  aliasing. 


One  can  imagine  a  routine  expand f )  which  performs  filename  aliasing 
before  passing  the  expanded  pathname  on  to  an  ooenf )  routine  which  performs 
directory  searching  to  find  the  actual  file  intended.  The  alias  filename  must 
appear  surrounded  by  '  =  '  signs  and  if  the  alias  file  contained  the  line: 


I  macros  /ub/jon/progs  I 

then  =macros=  would  be  transliterate  into  /ub/Jon/progs  by  expandO  before 


interpretation. 


I  might  decide  that  smaorosa  should  translate 


into  any  one  of: 
/uc/ jon/progs 
//jon/progs 
progs 


and  the  file  I  finally  get  would  depend  on  the  interpretation  which  is  per¬ 
formed  by  openO,  after  =macros=  is  transliterated,  by  expandO.  In  other 
words,  it  would  depend  on  the  current  directory  if  we  transliterate  into  a 
relative  pathname,  and  not  otherwise. 


Templates  Introduce  a  new  mapping  which  is  not  a  tree  and  which  does  not 
map  from  the  Internal  nodes  of  the  file  system,  set  by  'od',  but  from  the 
alias  or  template  file  of  'ourrent  user' ,  set  by  'login'.  A  template  file  is 
attached  to  an  account  not  to  a  directory,  and  does  not  change  as  a  user  works 
on  a  given  node  of  the  system.  It  is  set  as  he  logs  on. 


However,  a  user  may  have  accounts  on  more  than  one  machine.  Since  tem¬ 
plate  files  are  per-account,  and  not  per-user ,  they  are  a  looal-system  concept 
which  has  to  be  extended  in  some  way  in  extending  the  system. 


Suppose  'jon'  has  two  template  files,  one  on  logical  disc  ub,  and  one  on 
logical  disc  uf. 

/ub/Jon/template : 


I  macros  //jon/macros 
!  other  /ub/ jon/other 


/.■Nvy'-vv!'-' 
v\*\*  vN  •  \* 
>>><:•  .*Lv 


v'v  *  'WO1  v\ 

■  *—  •  >-*■  .-Vfc-*,  k.-.a.-V/V.*  a-’  1*  i.'i.'  t'V* 
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/uf/ jon/template : 

I  macros  //jon/macros  I 

I  other  /ub/ jon/other  I 

I  I 

Then  when  'Jon*  Is  logged  on  to  machine  B,  the  following  mapping  will  be 
superimposed  onto  the  file  system  by  templates : 

-  root  - 


/ 

/ 

fred 

/  \ 

/  \ 


\ 

Jon 
/  \ 

/  \ 


uf 

\ 

\ 

Jon 
/  \ 

/  \ 


other  macros  other  macros  other  macros 


I  * — —  macros 

|  | - 

• - —  other 


while  If  'Jon*  Is  logged  onto  maohlne  F,  the  mapping  changes  to: 

-  root  - 


/ 

fred 
/  \ 

/  \ 


uf 

\ 

\ 

Jon 
/  \ 

/  \ 


/  \  /  \  /  \ 
other  macros  other  macros  other  macros 
I  I 

I  _  I 

I  I  macros  — — —  ' 

|  | - 1 

i - other  | 


The  mixing  of  transliteration  (templates)  and  interpretation  (direc¬ 
tories)  leads  to  some  unexpected  'features'.  For  example,  if  the  text  format¬ 
ter  program  'fmt'  which  reads  text  souroe  files  and  formats  them  for  printing, 
also  allows  me  to  read  In  a  file  oontainlng  formatter  macro  definitions  by 
including  a  line: 

•so  filename 
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In  my  formatter  souroe  file,  then  I  can  make  up  a  text  formatter  source  file 
'file. tat'  which  calls  another  file  'macros'  using  the  text  formatter's  '.so* 
feature : 

file.fmt 

I  .so  smacross  | 

I  rest  of  file  I 

Now  if  I  format  the  file,  using  the  command  line: 

fmt  file.fmt 

the  formatter  will  first  read  file. fmt  and  on  the  first  line  find  the 
reference  to  file  =macros=.  In  attempting  to  open  =macros=  it  will  first  get 
it  transliterated  according  to  my  templates  file,  then  read  the  smacross  file 
(whichever  one  is  finally  opened  by  expandO  followed  by  open())  and  use  its 
contents  to  format  the  rest  of  the  document.  Supposing  that  I  am  on  machine 
B,  and  that  my  templates  file  on  that  machine  contains  the  line: 
macros  //Jon/macros 

then  it  will  be  transliterated  to  //Jon/macros,  interpreted  according  to  the 
machine  I  am  currently  on,  and  finally  read  in  as  file  /ub/Jon/macros. 

If  I  were  logged  in  on  machine  F,  with  an  identical  templates  file,  and 
read  in  the  same  formatter  source  file  'file.fmt* ,  then  although  expandO  per¬ 
formed  the  same  transliteration,  the  macros  file  finally  read  would  be 
/uf/ jon/macros !  This  might  or  might  not  be  an  identical  file. 

In  other  words,  there  are  two  levels  of  potential  confusion: 

1.  You  can  have  dissimilar  template  files  in  your  aocounts  on 
different  machines,  and  then  the  same  template  may  be 
transliterated  differently. 

2.  Even  if  the  template  is  transliterated  identically  on  the 
machines,  the  faot  that  you  may  be  in  different  directories  on 
the  different  machines  may  lead  to  the  interpretation  of  the 
transliterated  template  being  different. 

Clearly,  to  this,  and  to  many  of  the  subsequent  difficulties  I  shall  raise, 

there  are  purely  administrative  solutions. 

I  can  avoid  the  first  difficulty  by  having  identical  templates  on  all 
machines,  and  undertake  to  keep  them  all  identically  updated. 

I  can  avoid  the  second  difficulty  by  making  all  templates  transliterate 
into  absolute  pathnames: 

macros  /ub/ Jon/macros 

However,  the  fact  that  I  have  to  solve  these  very  obvious  problems  by 
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user  action  suggests  that  there  Is  some  underlying  problem  which  Is  not  being 
solved  at  the  operating  system  level. 

Even  If  I  adopt  these  administrative  solutions,  there  remains  a  problem. 
For  example,  if  I  grant  READ  permission  on  'file* fat'  to  another  user  'fred', 
when  fred  reads  'file. fat',  the  formatter  will  try  to  transliterate  =m across 
according  to  fred's  template  file,  whioh  may  contain  no  line  for  =macros= ,  or 
worse,  a  line: 

macros  // fred/ some_other_f lie 

in  which  case  fred  will  format  my  formatter  source  file  with  an  unintended 
macros  file  whioh  may  be  grossly  Inappropriate,  leading  to  a  formatter  crash; 
or  it  may  be  very  subtly  incorreot,  leading  to  a  successful  run  of  the  format¬ 
ter  and  the  production  of  an  inoorreot  but  plausible  document. 

Fred  ought  to  have  access  to  my  template  file  when  formatting  my 
documents.  In  fact,  Fred  ought  to  Jjft  me,  or  rather,  be  me  in  this  project. 
In  order  for  the  operation  of  formatting  the  document  to  be  carried  out 
correctly,  interpretation  and  transliteration  ought  to  be  carried  out  per- 
project  rather  than  per-user. 

The  main  problem  is  the  mixing  of  two  operations,  transliteration  and 
interpretation,  which  look  similar  but  are  quite  different.  Transliteration 
is  always  carried  out  in  the  context  of  the  account  (not  the  user,  since  he 
may  have  accounts  on  several  machines),  while  interpretation  is  carried  out  in 
the  context  of  the  file  system  subtree  identified  by  the  pathname. 

In  summary,  if  you  tie  together  several  existing  file  systems  by  connec¬ 
ting  their  roots,  then  a  single  user's  files  will  no  longer  be  in  one  file 
system  subtree,  but  in  several,  one  on  each  machine.  This  leads  to 
inconsistent  file  pathname  mapping. 

Operations  are  performed  per-acoount  (such  as  template  handling)  which 
ought  rather  to  be  performed  per-user. 

If  this  is  resolved  by  administrative  measures,  such  as  the  use  of 
absolute  pathnames,  the  user  will  be  forced  to  be  aware  of  his  physical  loca¬ 
tion  in  the  system,  rather  than  of  his  logical  looation. 

a.2.2  Zila.  Storm  atruQturgfl 

Given  that  file  system  structures  should  reflect  some  logical 
relationship  between  the  files  that  they  contain,  some  other  questions  arise. 
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There  does  not  seem  any  good  reason  why  a  user  should  not  be  able  to 
generate  and  use  a  file  system  subdivision  (not  a  subtree)  which  crosses  disc 
boundaries,  and  even  overlaps  with  other  subdivisions.  Directories  should  be 
able  to  contain  entries  pointing  to  directories  or  files  on  other  disks. 

A.2.3  Distribution 

Now  we  can  take  a  look  at  the  consequences  for  a  distributed  system. 
One  finds  that  in  a  distributed  system  the  concepts  of  transliteration  and 
interpretation  change  in  subtle  ways.  Even  in  a  centralized  system,  we  have 
template  mappings  which  change  according  to  login  account,  as  well  as  file 
system  mapping  which  do  not  change  if  absolute  pathnames  are  used,  or  which  do 
change,  if  they  are  relative  pathnames.  The  only  reason  that  relative  path¬ 
names  change  their  meaning  is  that  'login*  Implies  'od'. 

In  a  distributed  system  the  problem  is  complicated  because  jobs  can  run 
on  one  machine,  with  the  file  system  mappings  appropriate  to  that  machine,  but 
with  the  template  mappings  which  have  been  imported  from  the  machine  on  which 
the  user  is  logged  in.  In  effect,  we  have  a  'login'  which  does  not  imply  a 
• cd'  (unless  we  want  to  do  a  remote  'login*,  in  which  case  it  is  not  clear 
what  happens). 

One  thing  that  a  distributed  system  should  try  to  do  is  to  allow  a  user 
on,  say,  B,  to  run  commands  on  remote  machines,  using  a  syntax  such  as: 
fmt@F  file. fmt 

In  order  to  achieve  this,  the  command  line  is  sent  to  the  command  interpreter 
on  machine  F.  Along  with  the  command  line  is  sent  the  ' ourrent  oontext' 
(including  the  template  file)  of  the  user.  In  fact,  what  is  sent  is  the 
current  context  of  the  account  of  that  user  on  B,  and  is  potentially  quite 
different  to  his  context  on  machine  F.  (He  might,  for  example,  have  no 
account  on  F  or  might  not  be  logged  on  there  at  present  ). 

Now  we  have  the  potential  for  a  Job  run  on  F  from  B  to  produce  different 
effects  from  that  same  job  run  on  either  B  or  F  directly. 

Suppose  we  have  the  same  source  file  'file.ffct'  on  /ub. 
file. fmt 

!  .so  = macros=  | 

!  rest  of  file  I 

This  file  will  not  be  the  file  accessed  by  the  command  line: 


fmt€F  file. fmt 
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instead,  the  command  line  would  have  to  be  changed  to: 
fmt@F  /ub/Jon/file.fmt 

Now  we  can  send  the  command  line  to  F,  along  with  the  current  context,  and  see 
what  happens.  The  remote  job  executes  on  F  as  if  executing  on  behalf  of 
/ub/jon,  as  far  as  transliteration  is  concerned,  since  we  sent  the  template 
file  of  /ub/jon  along  with  it.  When  the  formatter  on  F  reads  /ub/jon/file.fmt 
he  sees: 

file,  flat 

I  .so  =macros=  I 

I  rest  of  file  I 

transliterates  =macros=  into  //jon/macros,  for  example,  and  now  interprets 
//jon/macros  in  the  context  of  the  machine  he  is  running  on,  attempting  (or 
succeeding)  to  read  a  file  /uf/jon/macros.  In  effect  the  job  has  run  partly 
(transliteration)  as  if  belonging  to  /ub/jon,  and  partly  as  if  belonging  to 
/uf/jon.  This  raises  some  interesting  questions  about  location  information, 
and  how  much  of  it  the  user  has  to  be  aware  of. 

And  it  is  not  always  an  answer  to  suggest  that  interpretation  should 
also  be  carried  out  in  the  account  context  sent  with  the  coimnand  line  to  the 
remote  machine,  since  some  users  will  want  their  remote  command  lines  to 
access  within  the  context  of  the  machine  that  they  are  on.  In  particular, 
someone  who  wants  to  use  the  remote  command  line  execution  feature  as  'remote 
login'  wants  precisely  that,  to  transliterate  according  to  the  machine  he  is 
on,  and  access  files  according  to  the  remote  machine  he  is  remotely  logged 
into.  Or  does  he?  Some  of  these  problems  can  be  avoided  by  simply  using 

absolute  pathnames  at  all  times,  but  then  you  solve  the  immediate  problem  at 

the  cost  of  giving  up  the  entire  template  and  file  pathname  interpretation 
system. 

As  usual,  you  can  kludge  these  problems  away  by  totally  rigldifying  the 
system.  For  example,  adopting  the  rules 

•  all  templates  map  into  absolute  pathnames. 

•  all  remote  command  lines  contain  absolute  pathnames. 

•  all  remotely  executed  souroe  files  call  absolute  pathnames. 

But  maybe  this  is  not  what  is  wanted  in  a  distributed  system.  Maybe  I 

want  to  get  "The  nearest  file  oalled  'maoros',  or  a  further  away  one  if  the 

nearest  node  is  down”.  This  would  amount  to  allowing  the  transliteration  of  a 
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given  template  to  vary  from  time  to  time  during  the  execution  of  a  Job.  This 
would  require  the  file  system  to  behave  much  more  like  a  database,  recording, 
for  example,  if  two  files  called  'macros'  are  identical  copies,  or  separate 
objects.  And  I  would  like  to  have  at  least  some  facilities  equivalent  to 
relative  pathnames  and  templates. 

A.3  SQL0TI0MS 

In  this  section  we  attempt  to  present  solutions  to  some  of  these 
problems  in  a  fairly  general  way.  In  some  cases  these  solutions  call  for  sim¬ 
ple  modifications  to  the  local  operating  system  file  system. 

A. 3.1  A  TVH*n'n  Structured  File  System 

In  order  to  avoid  some  of  the  difficulties  listed  in  file  system  addres¬ 
sing  I  am  suggesting  that  we  use  a  file  system  which  is  basically  domain  (or 
capability)  based. 

The  underlying  structure  of  the  file  system  will  be  unchanged,  and  will 
be  tree- structured,  but  users  will  be  able  to  set  up  'domains'  at  any  point  in 
the  file  system  to  which  they  have  access,  which  will  allow  them  to  address 
any  files  they  want  in  a  'one  step'  fashion.  The  files  to  which  a  user  has 
access  will  be  determined  by  owned  capabilities  rather  than  by  access  list. 
We  will  ohange  the  routines  expand ()  and  open()  to  implement  the  new  file 
system  structure  on  top  of  the  old. 

A.3. 2  Description  AOd  Implementation 

It  seems  that  all  the  tools  that  you  need  in  order  to  construct  a 
domain-based  file  system  already  exist  in  most  local  operating  systems,  since 
my  proposal  basically  uses  only  an  extension  of  the  ’template'  mechanism. 

A  domain  structured  file  system  allows  a  user  to  set  up  a  context  for 
himself  in  the  file  system  at  any  point.  In  order  to  do  this  he  uses  local 
template  files.  At  present,  template  files  are  per  user  (actually  per 
account,  since  there  is  an  individual  template  file  on  eaoh  machine  for  each 
account)  and  the  merit  of  templates  is  that  they  allow  very  simple  'one-atep' 
file  name  transliteration. 

Typically,  one  now  keeps  all  the  files  for  a  particular  program  in  a 
single  directory  or  subtree,  except  for  central  files  used  by  several 
projects,  and  once  you  have  done  a  'od'  to  that  project  directory,  you  can 
then  refer  to  those  files  using  very  short  relative  pathnames,  typioally  only 
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one  element  long;  the  name  of  the  file.  This  only  works  for  files  in  that 
directory;  hence  templates,  which  allow  one-step  file  naming  for  files  in 
other  directories. 

One  can  achieve  the  same  effect  in  a  more  consistent  way  by  abandoning 
direct  use  of  pathnames  altogether,  and  relying  totally  on  indirect  use  of 
pathnames  through  templates.  In  effect  we  redefine  a  directory  to  the  tem¬ 
plate  file  at  some  position  in  the  file  system. 


Consider  a  user  *jon'  who  has  several  projects  under  development 
concurrently.  Each  project  has  a  program  source  file;  progl.r,  prog2.r,  etc, 
and  a  documentation  file  doel.fmt,  doc2.fmt,  etc.  In  addition  each  program 
calls  an  •include*  file;  def.i,  which  contains  some  common  program  definitions 
used  by  both  programs.  When  run,  each  program  will  read  data  files  datal, 
data2.  etc. 


/ 

/ 

/ 

ub 

I 

Jon 


root  — 


uf 


/I  I  \ 

/  I  I  \ 
docs  projl  include  proj2 
I  II! 

Idoel.fmt  Iprogl.r  idef.i  |prog2.r 

|doc2.ftat  |  |  |prog2.b 

|doc3.fmt  j  j  j 

jdocM.fmt  |  I  I 


I 

jon 
/  \ 

/  \ 

proj3  data 

I  I 

!prog3.r  | datal 
I  Idata2 

I  |data3 

I  | test 1 

Itest2 
Itest3 

Now  we  can  build  a  domain  (local  template  file)  for  projectl  which  is 
suitable  for  program  development  and  testing. 
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b', 


'V 


testingl : 


prog.r 

data 

obj 

prog 

def.i 


/ub/ Jon/pro J 1 /progl . r 
/uf/jon/data/testl 
/ub/jon/projl /progl .r 
/ub/ jon/pro j 1 /progl 
/ub/ Jon/include/def . 1 


We  can  build  a  similar  domain  for  projeot  3: 
testing3: 


prog.r 

data 

obj 

prog 

def.i 


/uf / Jon/proJ3/prog3 . r 
/uf/jon/data/test3 
/uf / Jon/pro j  3/prog3 . r 
/uf/ jon/pro J3/prog3 
/uf/ jon/include/def .  1 


It  may  be  worth  noting  a  few  points  here: 


•  All  pathnames  in  the  domain  file  are  absolute;  they  all  have 
exactly  the  same  effect  no  matter  where  on  the  network  they  are 
evaluated. 


e  In  use,  the  domain  files  allow  you  to  address  each  file  *ln 
scope'  by  a  short,  one  element  name.  For  example,  when  projectl 
is  in  effect  (for  SWT,  when  testingl  is  the  current  template 
file)  editing  'prog.r'  edits  the  file  /ub/jon/projl /progl. r. 
Naturally,  you  can  choose  any  names  for  files  that  suit  you.  A 
one- step  operation  of  changing  the  domain  in  effect  to  testlng3 
changes  'prog.r*  to  /uf/Jon/proj3/prog3.r. 


e  Some  files  appear  in  more  than  one 
shared  files;  for  example  'def.i*. 


domain  file.  These  are 


e  The  domains  are  similar  for  similar  projects, 
that  standard  domains  might  be  parameterlsed. 


This  suggests 


e  We  can  produce  other  domains  which  use  the  same  files  in 
different  ways,  for  example,  a  domain  suitable  for  documenting 
projectl : 


document  1 : 


prog 

doc 

defs 

defdoc 


-  /ub/jon/projl /progl .r 

-  /ub/Jon/doc/dool.fmt 

-  /ub/ jon/include/def. i 

-  /ub/Jon/doc/def.fmt 


e  A  file  can  have  different  names  in  different  domain  files.  When 
documenting,  we  are  no  longer  concerned  with  binary  and  object 
files,  so  there  is  no  longer  a  need  to  make  them  visible  in  the 
domain.  We  could  have  another  domain  for  testing: 


runl : 


prog 

data 


-  /ub/ J  on/pro j 1 /progl 

-  /uf/Jon/data/datal 


e  More  than  one  file  oan  appear  with  the  same  name  in  different 
domain  files;  'data'  was  a  test  file  in  testingl,  but  a  regular 
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data  file  in  runl . 

•  Once  again,  at  run  time  we  are  not  concerned  with  file  prog.r. 

Once  a  program  has  been  compiled  we  no  longer  care  about  its 
source  and  object  files. 

A. 3*3  P«ing 

Templates  are  taken  from  whichever  domain  file  is  in  effect  at  a  given 
moment,  so  we  need  a  command  which  will  transfer  us  from  the  current  domain 
file  to  a  new  one.  If  we  are  currently  testing  projeotl ,  and  we  want  to 
switch  to  documenting  projeot3  we  might  type 
cd  documents 

The  effect  of  the  cd  command  is  to  make  documents  the  current  domain  file,  to 
make  all  the  files  listed  in  testlngl  inaccessible,  and  make  the  files  listed 
in  documents  accessible. 

Ve  might  choose  to  have  testlngl  disappear,  or  to  have  it  pushed  onto  a  domain 
stack,  from  which  it  can  be  recovered  by  a  'pod'  (pop  domain)  command.  Haybe 
we  should  also  have  an  explicit  'pud'  (push  domain)  command) 

Of  course,  the  new  domain  file  used  by  *ed'  must  appear  in  the  ourrent 
domain  file,  so  we  shall  have  to  modify  testlngl  and  documents  so  that  we  can 
execute  the  'cd*  command: 

testlngl : 

prog.r  -  /ub/ jon/pro Jl/progl.r 

data  -  /uf/jon/data/testl 

obj  -  /ub/jon/projl/progl.r 

prog  -  /ub/j on/pro jl/progl 

def.i  -  /ub/jon/include/def.i 

next  -  /ub/jon/domalns/document3 

documents: 

prog  -  /ub/Jon/proJ3/prog3.r 

doc  -  /ub/jon/doc/doo3.fmt 

def s  -  /ub/ Jon/inolude/def . i 

defdoc  -  /ub/ jon/doc/def . Ant 

next  -  /ub/jon/domalns/testingl 

Now,  when  testlngl  is  in  effect: 

cd  next 

switches  us  to  documents >  at  which  point  the  effect  of: 
cd  next 

is  to  take  us  back  to  testlngl  If  we  want  to  do  something  more  than  switch 
back  and  forth  between  these  two  domains,  one  of  them  will  have  to  oontain  the 
name  of  some  other  domain  file: 
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testlngl : 

prog.r  -  / ub/ jon/pro J1 /prog 1 .r 

data  -  /uf/Jon/data/testl 

obj  -  /ub/Jon/projl/progl .r 

prog  -  /ub/Jon/projl/progl 

def.i  -  /ub/jon/include/def.i 

next  -  /ub/jon/domains/document3 

other  -  /ub/jon/domains/testing2 

•Other*  is  the  name  of  another  domain  file  we  might  wish  to  use. 


testlngl : 

prog.r 

/ub/Jon/projl/progl  .r 

data 

/uf/jon/data/testl 

obj 

/ub/Jon/projl/progl  .r 

prog 

/ub/j  on/pro j 1 /prog 1 

def .  i 

/ub/ jon/inolude/def . i 

next 

/ub/ jon/domains/document 3 

other 

/ub/jon/domains/testing2 

out  _ 

/ub/Jon/domains/master_domain 

•Out'  is  the  name  of  some  top-level  domain  file  containing  (perhaps)  the  names 
of  all  the  domain  files  we  would  consider  there: 


master_domain: 


testlngl 

testing2 

testing3 

documentl 

document2 

document3 

runl 

run2 

run  3 


/ub/jon/domains/testingl 

/ub/jon/domains/testing2 

/ub/Jon/domains/testing3 

/ub/j  on/domains /document 1 

/ub/jon/domains/document2 

/ub/jon/domains/document3 

/ub/jon/domains/runl 

/ub/jon/domains/run2 

/ub/Jon/domains/run3 


He  might  say  that  *master_domain'  is  the  'root*  of  the  domains. 


We  can  structure  our  work  as  we  structure  our  file  system,  by  restrict¬ 
ing  the  domains  we  can  reach  from  the  current  domain.  It  may  be  highly 
appropriate  to  have  documentl  as  the  only  domain  you  can  get  to  on  exit  from 
testlngl.  This  would  enforce  good  habits  of  work. 


A.3.4  Hulas 

File  system  domains  should  have  the  same  rules  as  capability  system- 
s [Cook  79]. 

Every  file  listed  in  the  ourrent  domain  is  accessible. 

No  file  not  listed,  including  other  domain  files,  is  accessible. 

The  domain  file,  not  the  directory,  gives  the  access  rights  to  a  file. 
Access  rights  should  be  listed  in  the  domain  file.  For  example: 


•m 
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testingl : 


prog.r  -  'ub/ jon/pro jl/progl.r 


data 

obj 

prog 

def.i 

next 

other 

out 


/uf/ jon/data/ testl 
/ub/ jon/proj 1 /progl . 

/ub/ Jon/pro Jl/progl 
/ub/ Jon/inolude/def . i 
/ub/ J on/domal ns/do  cumentB 
/ub/Jon/domains/testing2 


_  /ub/jon/domains/mstr_domain  ED 


gives  read/write  access  to  'progl ,r',  whereas: 


document3 : 


prog 
doe 
defs 
defdoc  - 
next 


/ub/ jon/proj3/prog3 . r 
/ub/jon/doc/doc3. fmt 
/ub/Jon/include/def . i 
/ub/Jon/doc/def .  flat 
/ub/jon/domalns/testlngl 


only  gives  read  access  to  *prog3',  (You  ought  not  to  be  able  to  modify  a 
program  while  you  are  documenting  it.) 


The  domain  rights  are: 


read/write  data 
read  data 
execute  data 
enter  domain 
read  domain 
write  domain 


Rights  on  domain  normally  only  allow  you  to  enter  it  (ED).  You  are  not 
allowed  to  read  or  edit  the  current  or  any  domain  file  unless  you  have  RD  or 
HD  access  to  it,  and  this  must  be  specified  in  the  current  domain  file.  For 


Instance: 


master_domal n : 


testingl 

testlng2 

testlng3 

document 1 

document2 

document3 

runl 

run2 

run3 

testingl 

testlng2 

testing3 


/ub/ Jon/domains/testi  ED 
/ub/J on/domains/ test2  ED 
/ub/Jon/domains/test3  ED 
/ub/jon/domalns/docl  ED 
/ub/jon/domalns/doc2  ED 
/ub/Jon/domains/doc3  ED 
/ub/Jon/domal ns/run 1  ED 
/ub/Jon/domains/run2  ED 
/ub/Jon/domains/run3  ED 
/ub/jon/domalns/testl  WD 
/ub/jon/domains/test2  WD 
/ub/Jon/domains/test3  WD 


allows  you  to  enter  several  domains,  and  also  to  modify  testingl,  2,  and  3. 


Otherwise,  the  domain  files  should  be  unwrlteable  to  the  user.  He 
should  not  be  able  to  change  his  current  domain  setup  informally. 
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However,  once  you  have  modify  rights  on  a  domain  file,  you  can  change  it 
with  the  editor. 

Domain  enforcement  is  performed  by  the  file  system  'open'  routine.  This 
routine  is  modified  so  as  always  to  look  in  the  'current1  domain  file  and 
translate  any  filename  string  given  it  according  to  that  domain  file.  If  the 
string  supplied  to  'open'  appears  in  the  domain  file,  it  is  located  on  the 
network  file  system  and  opened,  but  only  with  the  aocesa  permissions  allowed 
by  the  current  domain.  No  file  not  listed  in  the  current  domain  is  opened, 
and  an  error  is  returned  to  the  user  on  failure. 

This  is  a  more  extensive  change  than  simply  to  use  templates  and 
absolute  pathnames,  which  can  be  done  with  the  present  system. 

The  'current1  domain  is  that  in  /ub/ jon/domains/current. 

The  effect  of  the: 

cd  new_domain 

command  is  to  perform  a: 

cp  /ub/Jon/domalns/new_domaln 

/ub/Jon/domal ns/current 

The  effect  of  'push': 

pus  ^domain  ney_domaln 

is  first  to  stack  /ub/jon/domains/ourrent,  and  then  copy 
/ub/jon/domains/new_domain  to  /ub/Jon/domains/aurrent. 

The  effect  of  'pop'  is  to  pop  the  domain  stack  into 
/ub/ jon/domains /current,  losing  the  current  contents  of 

/ub/ jon/domains/current . 

Domains  can  be  created  when  the  current  domain  has  CD  (create  domain) 
permissions.  A  domain  file  entry  (capability)  can  be  created  for  any  object 
the  user  owns. 

A  user  can  share  objects,  including  data  files  and  domain  files,  with 
other  users. 

A  user's  domain  file  can  be  sent  to  another  user,  with  or  without 
changes  (restrictions)  on  the  permissions  on  its  lines. 

When  new  objects  are  created,  a  side-effeot  is  the  appearance  of  a  new 
line  for  the  new  object  in  the  current  domain  file. 


Appendix  A 


Extending  File  Systems 


Page  17 


This  new  line  can  be  moved  to  another  domain  file  if  that  file  is  acces¬ 
sible  from  the  current  domain  (directly  or  indirectly). 

Each  domain  file  for  a  user  appears  in  a  given  directory,  say  'domains'. 

Then  the  effect  of  a  change  domain  command  is  to  copy  the  named  domain 
file  into  /swt/vars/user/. templates 

The  routine  'expand'  already  looks  at  /swt/vars/user/. template  when  it 
encounters  a  pathname  containing  =string=.  Expand  then  has  to  be  modified  to 
assume  that  all  pathnames  which  are  given  to  it  need  to  be  expanded.  Expand 
will  look  in  the  current  templates  file  (the  current  domain)  and  perform 
transliterations  according  to  the  contents  of  the  file.  Strings  which  do  not 
appear  in  the  current  domain  will  not  be  transliterated,  and  error  will  be 
returned. 

A.3*5  rn— mwh  files 

So  far  we  have  only  mentioned  data  files,  while  in  fact  we  could  also 
deal  with  commands  in  the  same  way. 

In  most  systems,  each  user  has  a  (static)  'searoh_rulo'  which  records 
the  order  in  which  the  command  interpreter  should  search  certain  directories 
for  the  commands  that  he  runs.  When  the  user  Invokes  'amd',  the  command 
interpreter  will  run  the  command  of  that  name  which  is  found  earliest  while 
searching  the  directories  in  'searobjrule' . 

In  a  domain-based  file  system,  each  domain  file  can  clearly  contain  a 
new  'searotu*ule'  and  that  means  in  turn  that  while  using  another  user's 
domain  file,  you  also  inherit  his  'searob_rule'.  To  a  very  large  extent,  this 
means  that  you  beoome  that  user,  sinoe  you  not  only  process  his  files,  but  you 
are  constrained  to  process  them  in  the  same  way  as  he,  and  using  the  same  com¬ 
mand  libraries. 

A.»  OHSOLTKD  PROBLEMS 

He  now  consider  some  problems  which  cannot  be  solved  without  some 
changes  to  the  underlying  system. 

1.4. 1  Kora  Binding 

We  mentioned  above  that  some  context  problems  can  be  avoided  by  using 
absolute  pathnames  at  all  times,  but  this  depends  on  the  absolute  pathname 
feature  being  available.  Files  are  somewhat  unusual  objects  in  having  path- 
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names.  Most  local  operating  systems  do  not  generalize  naming  concepts  to  all 
objects.  There  is  no  multi-level  pathnames  translation  scheme  such  as  file 
pathnames,  for  library  programs  or  physical  devices  (although  these  facilities 
do  exist  in  some  local  operating  systems,  such  as  Multlcs). 

A.4.2  Variables 

Some  command  languages  have  both  looal  and  global  command  variables, 
which  raises  an  interesting  question  in  a  distributed  environment;  where 
should  they  be  evaluated?  Local  variables  are  defined  only  in  the  aotivatlon 
of  the  shell  program  in  which  they  are  set,  while  global  variables  are 
associated  with  a  user  rather  than  a  shell  program  (and  they  are  stored  in  a 
variables  file  when  he  logs  off).  He  can  also  save  them  at  any  time  using  a 
save  command  language  command. 

This  raises  the  question  of  the  'scope'  of  variables.  Supposing  that  a 

process  on  A  spawns  a  remote  process  to  run  on  B,  presumably  the  variables 

file  of  the  user  is  exported  to  the  remote  process,  tfhat  happens  if  a 

variable  is  changed? 

echo  [vara] 

set£B  vara  =  newvalue 

echogA  [vara] 

What  happens  if  a  variable  is  changed,  and  the  remote  ooemand  line  invokes  a 
•save'  function  on  the  variables.  Does  this  cause  the  new  variables  to  be 
saved?  And  then  can  the  original  job  on  A  also  do  a  'save'? 

A.4.3  TanmitM  FUBOtlPM 

Command  language  functions  can  return  a  single  sucoess  code  directly  in 
such  a  way  that  it  can  be  evaluated  in  a  command  language  'If  statement. 
This  means  that  the  command  line: 

if  [function]  then  SI  else  S2; 

executes  command  language  statement  SI  if  [function]  returns  TRUE  and  S2 
otherwise. 

This  is  going  to  have  odd  results  unless  [function]  is  executed  on  the 
same  machine  as  the  'If'.  If  the  'If'  assumes  different  context  information 
to  that  assumed  by  the  function  (for  example,  a  different  interpretation  of 
relative  pathnames)  then  there  will  be  an  Inconsistency. 

A.4.4  Ot*- Condition 

Some  languages  allow  events  (conditions)  external  to  a  program  to  be 
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signalled  to  It  In  the  form  of  a  software- Implemented  interrupt.  It  is 
interesting  to  speculate  what  would  happen  if  we  declared  an  on-conditon  on 
one  machine  and  raised  the  condition  on  another. 

A.4.5  Conalualon 

Other  problems  in  extending  single-machine  systems  to  run  on  multiple 
machines  are  easily  solved  by  adding  extra  software  over  the  original  operat¬ 
ing  system. 

Some  problems  are  not  so  easily  solved,  either  because  of  addressing 
deficiencies,  or  because  results  have  to  be  returned  to  an  indeterminate 
location. 
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APPENDIX  B 


IHTKHPRETIHG  AID  COMPILIBQ  COM4AHD  INTERPRETERS 


N.  J.  Llveaey 

B.1  CQHMAHD  INTERPRETERS 

A  conmand  interpreter  is  a  program  which  typically  establishes  the 
interface  between  the  user  and  the  operating  system  performing  a  translation 
from  the  input  commands  into  the  form  which  the  operating  system  understands. 
We  contend  that  there  are,  in  fact,  two  rather  different  forms  of  command 
interpreters,  those  which  execute  commands  dlreotly  (which  we  call  "compiling 
command  interpreters,")  and  those  whioh,  because  of  a  gross  mismatch  between 
the  world  as  the  user  sees  it,  and  the  world  as  understood  by  the  operating 
system,  have  to  intervene  at  almost  every  stage  of  the  execution  of  the  user's 
commands  (which  we  call  "Interpreting  command  interpreters"). 


This  note  is  a  proposal  for  a  compiling  command  interpreters  for  use  on 
an  FDPS  (Fully  Distributed  Processing  System). 


It  first  examines  the  differences  between  interpreting  and  compiling 
command  Interpreter's,  especially  with  respeot  to  their  'computational  power' 
and  the  run-time  overheads  that  may  be  expected  from  each. 


It  then  argues  that  it  is  possible  to  produce  a  command  interpreter  of 
the  compiling  rather  than  the  interpreting  kind  for  an  FDPS  and  supplies  a 


theoretical  framework  for  such  an  interface. 


It  finally  suggests  a  method  of  implementation  for  a  fullv  distributed 
compiling  oommand  interpreter. 


B.  1.1  Comil  lng  Varans  Interpreting 

In  this  section  we  define  the  two  kinds  of  command  interpreters.  An 
interpreting  com manrf  interpreter  is  a  comnand  interpreter  which  is  invoked  to 
carry  out  each  individual  command  on  a  command  line.  ±  compiling  rnwnanri 
interpreter  is  a  comnand  Interpreter  whioh  is  Invoked  to  scan  a  command  line, 
compile  it  into  some  intermediate  representation,  and  issue  the  Intermediate 
representation  to  the  operating  system  in  one  step. 
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B.1.2  CfttimianH  Line  Execution 

In  this  section,  we  explain  how  the  two  kinds  of  command 
interpreter’s  would  handle  a  simple  command  line: 
fmt  file. fmt  |  os  j  sp 

This  command  line  is  a  ’pipeline’  (a  sequence  of  operations  in  which  the 
result  of  one  operation  is  immediately  "piped**  to  the  next  operation)  which 
calls  three  commands  in  series.  ’Fmt’  is  a  text  formatter,  'os’  is  an  over¬ 
strike  and  underscore  processor,  and  ’sp*  is  a  line-printer  spooler.  The  file 
'file.ftat'  is  input  to  the  formatter.  The  command  line  is  a  pipeline  in  the 
sense  that: 

1.  The  commands  'fmt',  ’os’,  and  ’sp'  are  filters,  programs  which 
will  take  a  continuous  stream  of  input,  perform  some  trans¬ 
formation  on  it,  and  produce  a  stream  of  output  which  can  be 
'piped'  to  another  filter. 

2.  The  output  from  one  command  can  be  redirected  to  be  the  input 
of  another  filter,  or  to  some  file  or  device.  In  this  example 
the  output  from  'fmt'  is  piped  to  the  input  of  'os’,  and  the 
final  output  of  ’sp'  goes  to  a  line-printer. 

A  Interpreting  command  Interpreter  would  handle  this  command  line  by 
first  parsing  it  to  find  its  three  pipeline  nodes  (fmt  file. fmt),  (os),  and 
(sp),  and  would  then  run  the  first,  the  formatter,  directing  its  output  to  an 
intermediate  file  ’tempi',  then,  when  the  formatter  terminated,  run  'os'  with 
'tempi'  as  input  and  'temp2'  as  output,  and  finally  run  'sp'  with  ’temp2'  as 
input,  and  the  line-printer  as  output. 

The  compiling  command  Interpreter  would  parse  the  command  line  in  the 
same  way,  but  would  then  cause  the  operating  system  to  set  up  a  separate 
process  for  each  command  line  node,  with  inter-process  communication  'pipes' 
between  successive  nodes,  and  allow  the  node  processes  to  run  concurrently, 
communicating  through  the  inter-process  communication  'pipes'. 

B.1 .3  iA  Aside  -  Processes  and  Processes 

One  reason  for  the  dlfferenoe  in  these  two  modes  of  operation  lies  in 
operating  system  structure,  in  particular  the  treatment  of  processes.  The 
process  primitive  in  a  system  can  be  identified  with  (at  least)  the  user  or 
with  a  task.  For  example,  in  Unix,  a  process  is  identified  with  a  task.  It 
consists  of  an  identifier,  an  address  apaoe,  input /output  ports,  and  a  process 
state,  so  that  it  can  be  separately  scheduled.  Unix  processes  can  create 
other  processes,  or  can  arrange  to  run  a  new  program  in  the  address  spaoe  of 
the  current  process,  destroying  the  program  presently  running  there.  They  can 


Appendix  B 


Command  Interpreters 


Page  23 


also  communicate  with  any  other  process  whose  Identifier  they  know.  Inter¬ 
process  communication  can  be  through  memory  buffers.  A  user  can  have  several 
inter-communlcatlng  concurrent  processes  running  on  his  behalf.  In 
particular,  his  command  interpreter  can  create  several  processes  from  one  com¬ 
mand  line  to  run  the  elements  of  a  pipline  concurrently. 

In  a  system  In  which  'process'  is  identified  with  the  user,  a  process 
represents  the  entire  state  of  an  on-line  user,  and  might,  for  example,  be 
tied  to  his  terminal,  and  contain  his  coimnand  interpreter  running  in  shared 
system  space.  If  the  user  runs  a  new  program,  it  is  loaded  and  run  in  the 
address  space  of  the  'user*  process,  replacing  (and  destroying)  the  previous 
program  run  there  rather  than  being  run  in  the  address  space  of  a  new 
concurrent  process,  and  there  is  little  idea  of  inter-process  comnunicatlon, 
except  through  the  file  system.  Successive  programs  running  in  the  user 
process  communicate  by  writing  intermediate  files,  and  one  program  has  to  run 
to  completion,  and  dose  the  intermediate  file  which  contains  its  results,  and 
exit,  before  the  next  program  can  open  that  intermediate  file  and  read  it  as 
input. 

B.1.4  A  Stank  nonnand  Interpreter 

There  is  an  intermediate  form  of  command  interpreter,  found  in  the  SOLO 
operating  System. 


In  the  SOLO  operating  system  the  single  'user'  process  maintains  a 
'program  stack'  so  that  one  program  running  in  the  process  can  invoke  a  second 
program,  at  which  point  the  first  program  will  be  suspended,  rather  than 
destroyed,  and  its  state  pushed  on  the  program  stack.  The  seoond  program  in 
turn  can  get  itself  en-stacked  by  invoking  a  third  program.  And  so  on. 


At  some  point,  a  program  terminates,  and  then  the  top  program  state  is 
popped  off  the  stack,  and  that  program  is  resumed.  At  any  time,  the  stack 
represents  the  states  of  several  suspended  programs. 


B.1.5  Computational  Power 

One  can  show  the  differing  power  of  the  kinds  of  shell,  by  considering 
them  as  computational  maohines.  Intuitively  it  seems  that: 


1 .  A  single  process  command  interpreter  is  equivalent  to  a  Turing 
machine  with  a  one-way  read-write  head. 

2.  A  UNIX  (multi-process)  command  interpreter  (called  a  "shell") 
is  equivalent  to  a  Turing  machine  with  a  two-way  read-write 
head. 


aft 
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3.  The  SOLO  command  interpreter  is  equivalent  to  a  Turing  machine 
with  a  single  pushdown. 

Cases  2.  and  3.  are  the  same  power,  but  case  1.  has  lesser  power.  In 
particular,  case  1.  does  not  allow  pipeline  elements  to  be  connected  by  pipes 
in  two  directions.  All  single  command  lines  have  to  be  capable  of  being 
serialised.  (Although,  of  course,  one  is  allowed  to  execute  a  single  command 
line  several  times  by  ’looping*.  This  involves  re-executing  the  individual 
commands  of  the  command  line  pipeline).  From  now  on  we  forget  case  3. 

B.1.6  Blaanad  Time 

There  are  some  obvious  differences  in  elapsed  time  to  execute  the  same 
command  line. 

1.  Does  not  allow  the  exploitation  of  concurrency  (parallellism) 
in  the  pipeline. 

2.  Where  concurrency  exists,  it  can  be  exploited.  In  a  single¬ 
processor  system  pipeline  elements  which  do  not  depend  on  one 
another  can  run  in  any  order,  but  concurrency  need  not  lead  to 
any  elapsed  time  saving.  In  a  multiprocessor  system  one  can 
get  true  physical  concurrency. 

B.1.7  Overhead 

We  can  try  to  evaluate  the  overhead  incurred  by  the  two  forms  of  command 
interpreter’s.  The  total  overhead  is  made  up  command  execution  time  (which 
should  be  the  same  in  any  system),  shell  overhead,  and  inter-process  com¬ 
munication  overhead. 

total_overhead  =  command_execution  +  shel3^_overhead  + 
inter- process_communicatior\_overhead 

B.  1.7.1  Command  Execution 

Total  command  execution  time  should  not  differ,  whether  commands  are  run 
serially  or  concurrently.  Call  this  time  ’C*.  This  might  be  affected  by 
swapping  and  paging. 

B.  1.7*2  Command  Interpreter  Overhead 

An  Interpretive  command  interpreter  will  be  reinvoked  as  each  element  of 
the  command  pipeline  terminates  (or  at  least  some  component  of  the  command 
execution  code  will  be  relnvoked). 

A  compiling  command  interpreter  is  Invoked  only  when  the  command  line  is 
parsed,  and  when  the  entire  line  terminates.  The  overhead  is  then 

1.  overhead  =  0(n) 

2.  overhead  =0(2) 


Appendix  B 


Command  Interpreters 


Page  25 


where  n  is  the  number  of  command  line  elements. 

B. 1 .7.3  Inter- process  Communication  Overhead 

We  can  try  to  evaluate  the  inter-process  communication  overhead  incurred 
by  the  two  forms  of  command  interpreter’s.  In  either  case,  interprocess  com¬ 
munication  involves  transferring  each  block  of  intermediate  results  from  one 
program  to  the  next,  or  from  one  process  to  another.  However,  in  the  ease  of 
successive  programs  running  in  the  same  process,  the  Intermediate  results  must 
be  written  to  a  disc  file,  which  must  then  be  closed  before  the  next  program 
can  open  and  read  it. 

In  the  case  of  concurrently  running  processes  the  intermediate  results 
can  be  buffered,  block  by  block,  through  central  memory. 

From  previous  results  obtained  on  the  MININET  project,  writing  to  the 
file  system  costs  around  1000  ms  per  block,  while  buffering  through  central 
memory  costs  around  1  ms.  per  block.  Then  we  have: 

1.  overhead  =  0(1000%) 

2.  overhead  =  0(m) 

where  m  is  the  number  of  blocks  of  intermediate  results  transferred. 

B.1.7.1*  Total  Overhead 

Combining  the  results  of  the  previous  two  sections  given  the  total  over¬ 
head  incurred  by  the  two  forms  of  command  interpreter's. 

1.  total  =  C  +  0(n)  +  0(1000%) 

2.  total  =  C  +  0(2)  +  0(m) 

B.1.8 

The  difference  in  elapsed  and  total  time  is  clear,  even  for  a  single 
processor  system.  For  a  multi-processor  system  we  conclude  that  a  compiling 
command  Interpreter,  capable  of  taking  advantage  of  the  parallelism  in  a  com¬ 
mand  line,  is  necessary. 

B.1.9  *  ni atm fifl—aad  Interpreter 

For  purposes  of  comparison,  here  is  an  outline  of  an  interpreting  com¬ 
mand  interpreter  for  an  FDPS.  The  basic  ideas  are: 

1.  A  command  file  can  be  parsed  onoe  and  for  all  to  produce  a 

graph  whose  nodes  are  Individual  processes  (We  take  the  view 

that  processes  are  tasks,  not  users),  and  whose  directed  edges 
represent  inter-process  comaunication  data  flow. 
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2.  The  graph  can  contain  choice  (i.e.  decision  and/or  selection) 
nodes  and  iteration.  (The  command  file  lines  are  not  neces¬ 
sarily  serialisable.)  The  graph  can  contain  ohoice  (i.e. 
decision  and/or  selection)  nodes  and  iteration. 

3.  In  order  to  exploit  the  parallelism  Inherent  in  the  process 
graph  and  also  present  in  the  particular  hardware  configura¬ 
tion  on  which  we  are  currently  running,  we  must  be  prepared  to 
distribute  a  representation  of  parts  of  the  prooess  graph  to 
the  processors  on  which  particular  processes  are  to  be  run. 

U.  In  order  to  distribute  the  information  contained  in  the  graph 
we  treat  each  node  as  a  single  unit  and  attach  to  each  node 
(representing  one  process)  exactly  the  information  needed  to 
connect  that  node  to  its  neighbors  in  the  original  graph  and 
to  handle  any  synchronization  which  may  arise.  Its  neighbors 
in  the  original  process  graph  are  those  process  nodes  to  which 
it  was  connected  by  inter-process  conmunlcation  flow  edges. 

5.  Then  we  call  resouroe  allocation  and  work  distribution  in 
order  to  find  out  where  each  process  is  in  fact  to  run. 

6.  He  then  send  the  connectivity  information  for  each  process  to 
the  processor  on  which  it  is  to  run. 

7.  Finally,  distributed  control  will  use  the  distributed  connec¬ 
tivity  Information  to  set  up  Inter-process  communication 
between  concurrent  processes  in  a  given  processor  and  inter¬ 
process  communication  between  concurrent  processes  running  in 
separate  processors. 

The  rest  of  this  document  suggests  ways  of  parsing  a  command  file  into  a 
process  graph,  and  ways  to  distribute  the  connectivity  information  so  that 
Distributed  Control  can  use  it  on  physically  separated  processors. 

The  information  whioh  is  distributed  so  that  looal  elements  of 
Distributed  Control  can  run  the  oomplete  command  file  oonslsts  of  IPC  tokens. 
Two  tokens,  a  send  and  a  receive  token,  are  sent  out  for  each  edge  in  the 
original  Process  Graph. 

We  have  not  specified  exactly  what  process  graph  edges  represent,  apart 
from  representing  IPC.  A  single  edge  might  represent  a  write- read  pair,  a 
communication  line,  a  message  buffer  being  sent,  a  synchronizing  signal,  or 
the  action  of  one  process  creating  another.  All  that  we  require  is  that  an 
edge  has  two  end-points  whioh  are  processes  (He  do  not  even  require  the  two 
processes  to  exist  at  first.  Perhaps  one  of  the  processes  is  being  oreated, 
or  dying). 

For  an  edge,  we  distribute  the  send  token  to  the  process  which  is 
initiating  the  IPC  transfer  represented  by  the  edge,  and  the  receive  token  jfca 
the  process  jtfuLfih  la  object  Jthft  I££  transfer. 
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Distributed  Control  uses  the  tokens  at  run-time  to  implement  distributed 
system  control. 

B.2  PROCESS  GRAPH  COMPILATIOH 
B.2.1  Process  Graph  Language 

A  command  file  can  be  represented  as  a  graph,  with  processes  as  nodes 
and  inter-process  communication  represented  as  directed  edges.  This  graph  can 
be  specified  in  a  specialised  programming  language,  the  Process  Graph 

Language. 

The  translation  of  a  process  graph  proceeds  in  six  stages: 

1.  Specification  of  the  process  graph  in  process  graph  descrip¬ 
tion  language.  This  language  is  described  in  the  next  sec¬ 
tion. 

2.  Parsing  of  the  process  graph  description  language  program. 

Detection  of  syntax  errors  in  the  process  graph  description 

language  program. 

3.  Construction  of  the  precedence  grflPfa  from  the  Process  graph 

program.  The  precedence  graph  is  a  directed  graph  whose  nodes 
represent  IPC,  and  whose  edges  express  the  precedence 

relations  between  IPC.  This  will  be  explained  further  below. 

4.  Checking  of  the  precedence  graph.  Detection  of  semantic 
errors  in  the  process  graph  description  language  program 

5.  Translation  of  the  precedence  graph  into  IPC  tokens.  Tokens 
are  used  at  run-time  to  enable  the  operating  system  to  enforce 
the  precedence  relations  between  IPC  which  were  laid  down  in 
the  process  graph. 

6.  Distribution  of  the  processes  of  the  process  graph,  along  with 
their  IPC  tokens,  according  to  instructions  issued  by  JBMOttTflft 
allocation  and  work  distribution. 

7.  Run-time  enforcement,  by  Distributed  central;  using  the  IPC 
token  lists. 

B.2. 2  Prooaaa  Graph  Ungnagfl  flrHfBr 

Here  is  a  simplified  grammar  of  the  PGL.  This  is  not  intended  to  be  a 
comprehensive  definition  of  the  grammar j  merely  a  summary  around  which  to 
build  a  description  of  the  process  graph  description  language  compiler.  For 
the  full  grammar  see  [Livesey  6].  The  semantic  actions  to  be  taken  upon  the 
satisfaction  of  each  production  are  given  in  the  right-hand  column  below  each 
production  as  a  seotlon  number,  referring  to  the  rest  of  this  section. 
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Production 


Section  Number 


<process  graph>  ::=  'begin’  <pg  head>  <pg  body>  'end*  B.2.2.1 


<pg  head>  ::  =  <prog  def>»  <process  def>#  <type  asg>*  B.2.2.1 
<edge  def>* 


<pg  body>  ::  =  'begin'  <stmt  list>  'end'  B.2.2.1 

<prog  def>  ::=  'prog:'  <prog  ident>  '=='  <string>  B.2.2.1 


<process  def>  ::=  'process:'  <process  ident>  '=='  B.2.2.1 

<prog  ident> 


<type  asg>  ::=  'trans:*  <ident  list>  I  B.2.2.2 

'root:'  <ident  llst>  I 
'perm:'  <ident  list>  I 
'graph:'  <ident  list>  | 

•edge:*  <ident  list>  I 

<edge  def>  ::=  <edge  ident>  ':  =  '  B.2.2.3 

<process  ident>  ' — >'  <process  ident>  I 
<process  ident>  'creates'  <prooess  ident>  I 
<process  ldent>  'dies'  — >  <prooess  ident>  I 
<process  ident>  'signals'  <prooess  ident>  I 


<prog  ident>  ::  =  'si'  I . !  'sn' 

<process  ident>  ::s  'tl'  I . .....I  'tn' 

<edge  ident>  ::=  'el'  | . I  'en* 

<string>  ::a 


<stmt  list>  ::=  <stmt>  |  <stmt  list>  <stmt> 


B.2.2.3 

B.2.2.3 

B.2.2.3 

B.2.2.3 

B.2.2.4 


<stmt>  <concatenatlon>  i  B.2.2.5 

<  binding  >  | 

<coneurrent>  | 

<procedure  call>  | 

<ohoice>  | 

<iteration>  I 
<edge  ident> 


<concatenatlon>  ::=  <stmt>  '<'  <stmt>  B. 2.2.6 

<eonourrent>  ::=  <stmt>  '*'  <stmt>  B.2.2.7 

<procedure  call>  : <proc  name>  '('  <stmt>  ')'  B.2.2.8 

<choice>  ::=  'if'  <expresslon>  'then'  <stmt>  I  B.2.2.9 

'if'  <expression>  'then'  <stmt> 

'else'  <stmt> 


<lteratlon>  ::=  'while'  <expression>  'do'  <stmt> 


B.2.2.10 


Appendix  B 


Command  Interpreters 


Page  29 


We  begin  by  building  a  very  simple  process  graph  from  a  command  file 
definition,  and  showing  how  it  might  be  handled  across  several  machines: 
begin 

prog;  progl  ==  "=bin=/fmt", 
prog2  ==  B=bin=/os", 
prog3  ==  "=bin=/sp"; 
proc:  prod  ==  progl, 
proc2  ==  prog2, 
proc3  ==  prog3; 
root :  prod ; 

trans :  proc2 ,  proc3 ; 

edge:  edgel ,  edge2; 
edgel  :=  prod  creates  proc2; 
edge2  :=  proc2  creates  proc3; 
edgel  <  edge2; 
end; 

In  this  example,  we  have  described  the  creation  of  three  processes  in 
series.  Process  'prod',  which  runs  the  program  "EbinE/fmt1',  creates  process 
'proc2',  running  program  "=bin=/os",  which  in  turn  creates  process  *proc3'i 
running  program  "=bin=/sp".  The  *<•  symbol  expresses  the  order  of  creation. 

The  exact  form  of  the  'process  graph*  does  not  matter,  but  it  might  look 
something  like: 


!  prod  { 
I  (fmt)  | 


creates  | 

- - >  !  proo2 

I  (os) 

I 


creates  | 

— - >  |  proc  3 

I  (sp) 


Enthusiasts  of  cryptic  systems  will  protest  that  all  this  oould  be  more 
succinctly  expressed  as: 

ftat  I  os  !  sp 

My  only  answer  is  that  my  lengthy  syntax  expresses  what  is  actually  going  on, 
and  that  in  any  case,  it  can  be  collapsed  to  a  oryptic  form  as  needed. 


Note  that  as  yet  we  have  only  expressed  process  creation,  and  nothing 
about  the  inter-process  communication  between  them.  That  will  come  later. 

Finally,  we  assume  that  the  three  processes  run  concurrently  on  one 
machine,  or  on  several. 

In  the  next  sections,  we  explain  the  grammar  step  by  step,  and  then 
expand  the  example  to  be  more  realistic.  First,  we  explain  the  aotions  to  be 
taken  upon  parsing.  The  explanation  is  top-down,  descending  to  greater 
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detail. 

B .2.2.1  Process  Graph 

In  parsing  we  build  up  a  symbol  table,  containing  the  definitions  of  the 
program  files,  processes,  and  IPC  edges,  and  a  precedence  graph,  which  defines 
the  precedence  relations  between  the  edges  of  the  process  graph. 

Upon  recognizing: 

<process  graph>  ::  =  'begin*  <pg  head>  <pg  body>  'end' 
the  parsing  of  the  process  graph  description  language  program  is  terminated, 
and  the  precedence  graph  constructed  (see  below),  is  processed  to  produce  IPC 
token  lists. 

Upon  recognizing: 

<pg  head>  ::  =  <prog  def>*  <process  def>#  <type  asg>» 

<edge  def >* 

the  building  of  the  symbol  table  is  terminated,  and  the  parsing  of  the  process 
graph  body  is  begun.  If  a  symbol  is  encountered  at  this  stage  whloh  is  not 
defined  in  the  symbol  table,  it  is  treated  as  undefined. 

In  our  example,  the  <pg  head>  is: 
begin 

prog;  progl  as  "abina/flnt", 
prog2  aa  "abina/os*, 
prog3  ss  "abina/sp"; 
proc:  prod  aa  progl, 
proc2  aa  prog2, 
proe3  as  prog3; 
root:  prod; 

trans:  proc2,  proc3» 

edge:  edgel,  edge2; 
edgel  :=  prool  creates  proo2; 
edge2  :s  proc2  creates  proc3; 
end; 

Upon  recognizing: 

<pg  body>  : : =  'begin'  <stmt  list>  'end' 

the  parsing  of  the  process  graph  description  language  program  has  terminated, 
and  the  precedence  graph  has  been  built. 

For  the  example  above,  the  <pg  body>  is  the  single  expression: 
edgel  <  edge2; 

The  symbol  table  contains  an  entry  for  each  program  file  (code  segment) , 
process  and  edge  in  the  process  graph  description  language  program.  In  this 
table,  file  system  pathname  strings  are  treated  as  predefined,  program  files 
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are  identified  by  their  file  system  pathname,  processes  are  defined  in  terms 
of  their  constituent  program  file,  and  edges  are  defined  in  terms  of  their 
sending  and  receiving  processes. 

For  the  process  graph  description  language  construct 
prog:  progl  =  "prog  name": 
which  satisfies  the  grammar  rule 

<prog  def>  ::=  'prog:1  <prog  ident>  '=='  <string> 
we  add  a  node  to  a  linked  list  of  program  file  definition  nodes. 

_ I _ 

!  v  I 
I  prog  I 

j  A prog  I - >  "progl" 

I  Aname-  i - >  "=bin=/fat" 

I  fwd  | 

. - 1 - 1 

v 

The  node  contains  both  a  pointer  to  the  program  file  Identifier,  and  to 
the  string  name  of  the  file  in  which  the  program  file  is  contained. 


A  process  definition  node  is  added  to  a  linked  list  of  process 
definitions  when  the  rule: 

<process  def>  ::  =  'process:*  <process  ldent>  '==*  <prog  ident> 
is  satisfied.  Every  process  is  defined  in  terms  of  its  code  part;  its  program 
file: 


V 

!  (process) 

process 

Aproc 

J - >  "procl" 

Aprog 

prog  | 

(program  file) 

1 

!  1 

Aprog  | 

- >  "prog2" 

V 

1 

Aname- ! 

- >  "=bin=/fmt 

| - 1 

I  I 

.  —  | — » 


w 


V 


B.2.2.2  Type  Assignment 

When  the  grammar  rule: 
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<type  asg>  :=  'trans:'  <ident  llst>  I 
♦root:*  <ldent  list>  I 
•perm:’  <ident  llst>  I 
’graph:*  <ldent  list>  | 

•edge:'  <ident  list>  I 

Is  recognized,  process  type  Is  assigned.  There  are  two  cases;  processes  which 
already  existed  when  the  process  graph  was  created,  and  processes  which  are 
created  during  the  processing  of  the  prooess  graph.  Pre-existing  processes 
are  of  two  kinds;  permanent  processes,  which  exist  outside  the  current  process 
graph,  and  the  root  process  of  the  process  graph,  which  is  created  by  the 
process  graph  supervisor  (a  component  of  Distributed  Control)  In  the  course  of 
process  graph  execution.  These  are  declared: 

perm:  process 1 ,  process2; 

root:  process 3; 

Typical  examples  of  pre-existing  processes  are  resouroe  processes,  which 
are  permanently  In  existence,  and  can  be  Included  in  any  process  graph 
authorized  to  access  them.  Pre-existing  processes  do  not  need  to  appear  as 
the  receiver  of  a  create  edge.  All  transient  processes  appear  as  the  receiver 
of  some  create  edge. 


Created  (or  'transient')  processes  come  Into  existence  during  the  execu¬ 
tion  of  the  current  process  graph.  They  are  declared  as: 
trans:  process!) ; 

Every  created  process  must  appear  as  the  receiver  of  some  create  edge  in 
the  current  process  graph  before  it  can  be  used  as  the  sender  or  receiver 
process  in  a  non-create  edge.  The  program  file  pointed  to  by  the  'prog'  field 
of  the  definition  node  of  a  created  prooess  is  used  at  run-time  to  create  that 
process : 


v  I 


process 


4prog 


•  —  J — i 


- >  "prooB" 


&prog 


- >  "progs'* 


— —>  "*bins/os" 
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Until  the  edge  which  has  the  creation  of  this  process  as  side-effect  appears 
in  a  precedence  expression,  the  process  is  marked  as  'not-yet-created*  in  the 
symbol  table.  If  it  is  used  in  another  edge  before  it  is  created,  a  compile- 
time  diagnostic  is  Issued.  After  the  creation  of  process  ’ proc2*  from  program 
file  prog2,  'proc2*  can  be  referred  to  in  process  graph  description  language 
statements  as  any  other  process. 

A  node  in  a  process  graph  may  be  another  graph,  rather  than  a  process, 
and  a  mesage  or  signal  may  be  sent  to  a  graph  in  the  same  way  as  to  a  process. 

B.2.2.3  Edge 

There  are  also  two  cases  for  edge.  In  the  first  case,  the  edge 
represents  a  send  to  an  existing  process,  and  in  the  second,  a  send  which 
creates  a  process  as  a  side  effect.  The  first  is  called  a  send-append,  and 
the  second,  a  send-create. 

A  send-append  is  the  transmission  of  IPC  from  one  already  existing 
process  to  another.  The  node  which  defines  the  edge  therefore  points  to  two 
process  definition  nodes,  which  must  both  refer  to  existing  processes;  either 
pre-existing  processes,  or  created  processes  which  have  already  been  created 
in  some  edge  in  the  current  process  graph  program.  The  grammar  rule  to  be 
satisfied  is: 

<edge  def>  ::=  <edge  ident>  »;*• 

<proceas  ident>  '—V  <process  ident> 

The  edge  e4  is  from  process2  to  process3,  both  pre-existing  processes, 
and  so  the  format  of  this  edge  node  will  be: 


- >  "edge4" 


■>  (process | 


process 


proc  - 


proc  !  — >  "process2 


— >  "processS" 


The  grammar  rule  to  be  satisfied  for  send-create  is: 
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<edge  def>  ::=  <edge  ident>  ' :=' 

<prooess  ident>  ’creates*  <process  ident> 

In  a  send-create,  the  receiving  process  Is  created  as  a  side-effect  of 
the  message  transmission.  The  edge  node  refers  to  a  sending  process  defini¬ 
tion  node  which  must  refer  to  an  existing  process,  and  to  a  receiving  process 
definition  node  which  refers  to  a  created  process  which  is  marked  as  not-yet- 
created.  At  run-time,  the  program  file  referred  to  by  the  receiving  proces¬ 
ses's  process  definition  node  will  be  used  to  create  the  receiving  process. 
In  the  example  below,  program  file  Bprog2''  will  be  used  to  create  process 
"proc2w. 


v 

edge 
prod 
proc2 

v 


— — >  [process 

I 

I  proc 
!  &prog 
I 


- >  [process | 

| - 

!  proc 


v 


I  V  I 

I  prog  | 

I  A prog  !->"prog2" 


v 

During  parsing,  a  send-oreate  edge  is  marked  in  the  symbol  table  as  a  'create' 
edge,  and  later,  when  the  edge  appears  in  a  precedence  expression,  the  receiv¬ 
ing  process  whloh  it  creates  is  marked  'oreated'.  If  a  process  not  yet  marked 
'oreated'  is  used  in  an  edge  in  a  precedence  expression,  a  diagnostic  is 
issued. 


The  grammar  rule  for  aend-dle  is: 

<edge  def>  : :=  <edge  ident> 

<process  ident>  'dies*  — >  <process  ident> 

This  defines  an  edge  whloh  has  the  death  of  the  sending  prooess  as  a 

side-effect.  A  message  is  sent  from  the  first  process  to  the  second,  and,  if 

successful,  the  first  prooess  dies. 


The  grammar  rule; 
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<edge  def>  <edge  ident>  •:=* 

<process  ldent>  'signals'  <process  ident> 

defines  an  edge  which  consists  solely  of  synchronization  information  passing 

from  one  process  to  another  in  the  current  process  graph,  or  from  a  process  in 

this  process  graph  to  one  in  another.  No  message  is  transmitted. 

For  simplicity,  we  have  given  the  definition  of  program  file  identifiers 
here  as: 

<prog  ident>  ::=  'progl*  I . I  'progn* 

The  process  graph  description  language  oompiler  can,  in  fact,  recognize 
identifiers  of  arbitrary  length,  oomposed  of  alphanumeric  characters,  and 
beginning  with  a  letter.  However,  in  the  rest  of  this  section  we  shall 
continue  to  use  the  suggestive  subset  <'prog1',22.  .  . ,'progn'>.  As  process 

identifier,  we  have  used  "processl ",  "process2",  etc.  As  edge  identifier,  we 
have  used  "el",  "e2",  etc. 

B.2.2.4  Precedence  Graph 

Here  we  describe  the  construction  of  the  precedence  graph  which  is  built 
in  the  course  of  compilation.  It  is  analogous  to  an  evaluation  tree  in  con¬ 
ventional  compilation.  The  precedence  graph  is  a  directed  graph  in  which  one 
or  more  nodes  are  present  for  each  construct  of  the  process  graph  description 
language,  and  the  dlreoted  edges  between  the  nodes  express  the  run-time  time 
precedence  between  the  nodes.  If  nodel  and  node2  are  connected  by  a  directed 
edge: 

nodel  — >  node2 

then  nodel  precedes  node2.  The  precedence  graph  may  contain  loops,  correspon¬ 
ding  to  iteration  constructs  in  the  process  graph  description  language. 

B.2.2.5  Statement  List 

A  statement  list  is  a  sequence  of  statements  SI,  S2......  A  statement 

list  is  represented  by  a  series  of  connected  nodes. 

A  statement  is  one  of  the  following  four  types. 

B.2.2.6  Concatenation 
The  construct: 

SI  <  S2 

which  is  represented  by  grammar  rule 

<concatenatlon>  ::=  <stmt>  '<'  <stmt> 
leads  to  the  linking  of  the  first  node  for  S2,  to  the  final  node  of  SI. 


The 
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statements  SI,  S2  can  be  single  edges,  or  process  graph  description  language 
statements. 

1  v  r 

i  edge  ! 

!  SI  | 


I  (  I 

_ ! _ 

!  v  I 

I  edge  ! 

! - 1 

I  S2  I 


B.2.2.7  Concurrency 

Each  construct  Indicating  concurrent  statements,  generated  by  the  satis¬ 
faction  of  the  grammar  rule 

<conourrent>  ::s  <stmt>  * A*  <stmt> 

is  represented  by  four  node  groups;  a  conourrency-begin  (A)  node,  the  two 
concurrent  statements,  and  a  concurrency-end  (v)  node.  For  example:  SI  *  S2 


v  I 
edge  | 


r-  «»' 
f.  a  ■ . ' 


:.r> 

>:-*>*■ 

,v>v 

Ar.vl' 


%  *.  ** 
m 
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The  nodes  marked  SI,  S2,  represent  arbitrary  process  graph  description 
language  statements.  The  only  restriction  is  that  they  should  be  subgraphs 
with  a  single  entry  and  a  single  exit. 

B.2.2.8  Procedure  Call 

A  procedure  call  is  the  principal  mechanism  for  communication  between 
processes  and  process  graphs.  The  grammar  rule  for  procedure  call  is: 
<procedure  call>  ::=  <proc  name>  *(*  <stmt>  ’)* 

The  representation  of  a  procedure  call  in  the  precedence  graph  is: 


!  v  1 

1  call  | 

! - 1 

f 

| 

i  1 
! 

1 

• 

• 

• 

• 

1 

1 

!  v  1 

1 

1 

1  edge  | 

• _ _ | 

1 

1  S2  I 

v 


The  "proc"  field  in  the  node  points  to  the  symbol  table  entry  for  the 
procedure  definition,  and  the  "pars"  field  is  a  pointer  to  the  subgraph 
representing  the  parameter  list  for  the  procedure  call.  Most  process  graph 
procedures  evaluate  logical  values  derived  from  edge  execution,  from  system  or 
hardware  error  checks,  or  passed  down  from  processes. 

B.2.2.9  Choice 

The  choice  construct  is  represented  by  four  node  groups:  Again,  SI,  S2 
represent  arbitrary  process  graph  description  language  statements.  The  choice 
grammar  rule  is: 

<ohoice>  ::=  ’if'  <expression>  'then*  <stmt>  I 
'if*  <expresslon>  'then*  <stmt> 

’else'  <stmt> 


The  <expression>  in  the  statement  oan  itself  be  a  process  graph  statement 
of  the  type:  single  edge,  process  graph  procedure  call,  or  bound  expression. 
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B.2.2.10  Iteration 

The  iteration  grammar  rule  is: 

<iteration>  ::=  ’while’  <expression>  ’do’  <stmt>  ’end' 

Iteration  is  a  special  case  of  choice.  The  iteration  construct  is  represented 
as  follows: 


Condition  and  S2  represent  arbitrary  process  graph  description  language 
statements. 


b.2.3  TranaliMon  Am  Task  graph  la  PrflQodmQfl  Qraph 

In  the  task  graph  description  language  there  are  specification 
statements,  assignment  statements,  and  expressions.  Specification  statements 
have  been  dealt  with  above.  Assignment  statements  are  used  in  building  the 
individual  nodes  of  the  precedence  graph.  The  precedence  relationships 
between  edges  are  spcifled  using  expressions,  using  the  operators  '<’. 
Expressions  may  be  nested  using  '(',  ')*.  As  these  edge  precedence  expres- 
slons  are  parsed,  a  'precedence  graph'  is  built  whloh  is  analogous  to  the 
expression  tree  built  during  the  parsing  of  arithmetic  expressions  in  an 
algorithmic  language  such  as  Algol  or  Pasoal. 
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•  Edge_-_<exp>_: :  =_<edge> 

Each  time  that  an  edge  is  recognized,  a  node  is  created  to 
represent  it  and  its  address  saved  on  both  the  root  and  leaf 
stacks.  It  will  become  clear  why  two  stacks  are  used: 

case  <edge>: 

/*  create__node{)  returns  the  new  node’s  address  */ 
node  =  create_node  (edge_node); 
push_root  (node); 
push_leaf  (node); 
break; 

We  shall  describe  algorithms  using  their  *C’  language 
representation  (see,  for  instance  [KERN\  3]).  In  this  language, 
procedure  invocation  has  the  syntax: 

procedure  ( paraml ,  param2 ,....); 

while  the  selection  of  an  element  from  a  data  structure  has  the 
syntax: 

structure  ->  element 

•  Precedence_-_<exp>_: :  =_<exp  1  >_'  <  '_<exp3> 

Each  time  that  a  precedence  construct  is  recognized  ,  we  are 
required  to  set  up  connecting  edges  (representing  time 
precedence)  between  two  subgraphs  in  the  precedence  graph,  one 
representing  <exp1>,  and  the  other  representing  <exp2>.  As 
<exp1>  and  <exp2>  were  recognized  themselves,  their  root  and 
leaf  addresses  were  pushed  on  the  root  and  leaf  stack,  so  at 
this  point,  the  top  elements  on  the  root  and  leaf  stacks  are  the 
root  and  leaf  addresses  of  <exp2>,  and  the  second-from-top  are 
the  root  and  leaf  addresses  of  <exp1>.  We  have  to  pop  these 
addresses,  link  <exp1>  to  <exp2>,  and  push  the  root  and  leaf 
addresses  of  the  linked  subgraph  representing  <exp1>  ’<’  <exp2>. 

case  <precedenoe>: 

temp  =  pop_leaf(); 
last  =  popJLeaf ( ) ; 
next  =  pop_root( ) ; 
last  ->  fwd_ptr  =  next; 
next  ->  parent  =  last; 
push_leaf  (temp); 


JU'O! 
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•  Concu  rrency_-__<exp>_: :  =_<exp  1  >_'  *  '_<exp2> 

An  operator  such  as  ’  *•  requires  more  complicated  code  genera¬ 
tion  than,  for  instanoe,  *+•  or  A  operator  takes  a 

task  graph  subgraph  as  its  right  and  left-hand  operands,  but 
then  these  two  subgraphs  must  be  terminated  by  a  *v'  node  which 
unites  their  two  lower-most  leaves.  The  expression  tree  of  a 
language  with  conventional  binary  operators  is  a  tree;  that  of  a 
task  graph  is  a  double  tree.  For  every  which  splits  flow  of 
control  into  two  parallel  streams,  there  is  a  later,  matching 
'v'  node  which  unites  these  two  streams  again.  For  Instance, 
the  expression 

(el  <  e2  <  e6)  “  (e3  <  e7)  A  e4  *  e5 
will  generate  the  following  subgraph: 


1  1 

el  1 

1  1  * 

e2  e3  1 

1  1  1  1 

1  1  e4  e5 

e6  e7  1 

1  1  v 

1 

—  1 

!  1  . 

_ | 

The  essential  point  is  that  each  opening  par  begin  •**  node 
(parallelism-  -begin)  must  be  matched  by  a  closing  par  end  'v' 
node.  It  follows  that,  as  well  as  enstacking  the  addresses  of 
the  roots  of  expression  subgraphs  as  they  are  formed,  we  also 
have  to  save  the  addresses  of  the  leaf  nodes  of  these  subgraphs, 
so  that  we  can  later  terminate  pairs  of  leaves  with  a  *v’  node. 
We  therefore  employ  two  stacks  in  precedence  graph  building,  an 
expression  root  stack  and  a  leaf  stack. 
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•  Choice_-_:  :=_'  if  '_<exp>_'  then'_<stmt>__'  else'_<stmt> 

An  analogous  problem  is  encountered  in  building  a  representation 
of  an 


if  P  then  SI  else  S2 ; 

This  is  solved  using  a  similar  double-stacking  algorithm.  In 
the  case  of  ’if*  however,  we  have  to  handle  three  expression 
subgraphs;  those  for  the  'if1  condition,  the  true,  and  the  false 
exit  subgraphs  from  the  'if'.  Each  of  these  three  subgraphs  is 
enstacked  as  it  is  created,  and  then,  upon  recognition  of 

<choice>  ::=  'if  <expression>  'then*  <stmt>  I 

'if'  <expression>  'then*  <stmt>  'else'  <stmt> 

the  subgraphs  are  de-stacked  in  reverse  order  (false,  true,  con¬ 
dition),  and  attached  to  the  'if  node.  Finally,  the  terminat¬ 
ing  'fl'  node  has  to  have  the  output  leaves  of  the  true  and 
false  'if  exit  subtrees  connected  to  it,  and  so  these  two  leaf 
addresses  are  de-stacked  from  the  leaf  stack.  The  code  for  han¬ 
dling  J£  and  £1.  nodes  is : 

case  <cholce>: 

node  =  new_node  ("if); 
node  ->  false_exit  =  pop_root  (); 
node  ->  true_exit  =  pp_root  (); 
node  ->  condition  =  pop_root  (); 

(node  ->  true_exit)  ->  parent  =  node; 

(node  ->  false_exit)  ->  parent  =  node; 

push_root  (node); 

node  a  new_node  ("fi"); 

false  =  popJLeaf  (); 

true  =  pop_JLeaf  (); 

false  ->  fwd_j>tr  =  node; 

true  ->  fwcL_ptr  =  node; 

node  ->  falseparent  a  false; 

node  ->  trueparent  =  true; 

push_leaf  (node); 

break; 
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•  Iteration 

The  code  for  <iteration>  is  an  obvious  variation  on  that  for 
<choice>.  The  true-exit  subgraph  connected  to  the  *f*  node  has 
its  leaf  node  connected  back  again  to  the  ’if*  node,  while  the 
false-exit  points  to  the  next  subgraph  in  the  precedence  graph 
after  the  while-do. 

case  <while>: 

node  =  nen_node  (’if'); 
node  ->  false_exit  s  0; 
node  ->  true_exit  *  pop__root  (); 

(node  ->  true_exit)  ->  parent  =  node; 
node  ->  condition  =  pop_jroot  (); 

(node  ->  condition)  ->  parent  =  node; 

push_root  (node); 

true  r  pop_J.eaf  (); 

true  ->  fwd_ptr  =  nod; 

pustiJLeaf  (node); 

break; 
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APPEIDIZ  C 

THE  DESIQR  OP  A  PROGRAMMING  LANGUAGE  BASED 
ON  COMIUNICATION  NETWORKS 

Aurthur  B.  Maooabe 
Riobard  J.  Leblanc 

C.1  IMTRODnCTIOM 

The  design  and  implementation  of  message-oriented  programming  languages 
has  recently  become  an  active  area  of  researoh.  The  Increased  activity  in 
this  area  is  due,  in  part,  to  the  increased  interest  in  distributed  processing 
systems.  Message-oriented  languages  structure  programs  as  collections  of 
processes  that  communicate  and  oooperate  using  message  transmission 
primitives.  Distributed  processing  systems  can  be  viewed  as  collections  of 
processor  nodes  with  independent  address  spaces  that  communicate  and  oooperate 
by  exchanging  messages  over  comnunication  ohannels.  Hence,  there  is  a  natural 
mapping  of  the  units  of  a  program  written  in  a  message-oriented  language  to 
the  resources  of  a  distributed  processing  system. 

This  paper  presents  the  design  of  a  new  message-oriented  language, 
PRONET  (Processes  and  Networks).  The  goals  of  PRONET  are  to  provide  a  high 
degree  of  process  Independence  and  a  mechanism  for  describing  process  hierar¬ 
chies,  while  obtaining  Information  about  inter-process  relationships  which 
will  aid  in  effective  program  execution.  In  most  message-oriented  languages, 
relationships  between  processes  must  be  expressed  in  the  descriptions  of  the 
individual  processes.  PRONET  has  been  developed  to  investigate  the  separation 
of  inter-process  relationships  from  the  description  of  processes.  This 
separation  is  expected  to  enhance  process  independence  while  Isolating 
information  which  will  aid  in  the  distribution  of  processes.  The  Initial 
design  of  PRONET  concentrates  on  inter-process  relationships  that  describe 
structural  aspects  of  the  communication  environment  used  by  processes.  To 
this  end,  PRONET  provides  powerful  features  for  describing  the  instantiation 
and  dynamic  reconfiguration  of  "conmunication  networks." 

"This  paper  was  published  in  the  Proceedings  of  the  3rd  Int.  Conf.  on 
Distributed  Computing  Systems,  Miami  Florida,  October  1982. 
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C.  1.1  PrngnamnH  ng  Environments 

PRONET  was  motivated  by  a  perceived  need  to  aid  application  programmers 
in  their  efforts  to  use  the  programming  environment  presented  by  a  distributed 
processing  system  [Fors8l].  Our  view  of  distributed  processing  systems  is 
based  on  the  definition  presented  by  Enslow  [Ensl78]  and  subsequently  refined 
by  Enslow  and  Saponas  [Ensl8lJ.  To  distinguish  the  systems  meeting  the 
criteria  of  this  definition  from  other  distributed  processing  systems,  they 
have  been  termed  "Fully  Distributed  Processing  Systems"  (FDPS). 

For  the  purposes  of  this  paper,  an  FDPS  is  a  collection  of  loosely 
coupled  processors  that  function  in  a  cooperatively  autonomous  fashion  to 
provide  services  ([Ensl78],  [Clar80]).  The  processors  are  autonomous  in  that 
their  activities  are  entirely  controlled  by  local  decision-making  criteria. 
To  avoid  total  anarchy,  the  decision-making  criteria  of  each  processor  are 
integrated  with  the  goal  of  cooperation.  This  cooperation  is  represented 
within  each  processor  by  a  component  of  the  network  operating  system  (NOS). 
The  primary  function  of  the  NOS  is  to  provide  a  unified  view  of  the  resources 
available  in  an  FDPS.  It  performs  this  task  by  imposing  a  layer  of  control 
above  the  processors  which  recognizes  and  respects  the  autonomy  of  the 
individual  processors.  The  assumed  existence  of  an  NOS  appears  to  distinguish 
the  environment  we  anticipate  from  that  assumed  by  other  researchers.  For 
example,  upon  request  the  NOS  will  provide  scheduling  and  allocation  functions 
based  on  its  global  view  of  the  network.  Thus,  when  a  new  process  is  created, 
a  program  can  use  the  NOS  to  determine  an  appropriate  physical  location  for 
the  process. 

c.  1.2  Logloal  CMBBualQaUon  Networks 

Programs  written  in  message-oriented  languages  may  be  viewed  as  specify¬ 
ing  "communication  networks".  The  nodes  of  these  networks  are  the  processes 
defined  or  used  by  the  program,  while  the  arcs  between  nodes  represent  com¬ 
munication  links.  These  communication  links  are  directed  and  may  be  used  in 
the  transmission  of  any  number  of  messages.  Note:  this  definition  of  "com¬ 
munication  network"  concentrates  on  connectivity,  other  definitions  are 
possible— for  instance  the  task  graphs  of  [Live80]  reflect  a  definition  that 
concentrates  on  communication  sequencing. 

Languages  that  support  dynamic  reconfiguration  of  communication  networks 
typically  do  so  by  allowing  processes  to  create  new  processes  and/or  allowing 
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processes  to  pass  the  names  of  processes  (or  ports)  to  other  processes 
([Kahn773*  [Feld79]»  [Lisk79]»  [Hewi79]).  Because  the  activities  that  control 
dynamic  reconfiguration  of  the  communication  network  are  intermixed  with  the 
activities  of  individual  processes,  they  are  not  readily  available  for 
examination  by  an  operating  system  or  a  person  attempting  to  understand  the 
program. 

Other  programming  languages/systems  that  support  a  similar  separation 
(UNIX  [Bour78],  Mesa  [Mitc79],  Task  Forces  [Jone79]  and  PCL  [Less79])  do  not 
enforce  a  complete  separation.  In  each  of  these  languages/systems  a  process 
may  specify  the  creation  of  new  processes  in  its  description.  Thus,  while  an 
abstract  view  of  the  communication  environment  is  available,  neither  the 
operating  system  nor  a  person  reasoning  about  the  program  may  rely  on  the  com¬ 
pleteness  of  this  view.  In  PRONET,  the  conditions  and  activities  associated 
with  any  structural  modification  of  the  communication  environment  (including 
process  creation)  must  be  stated  in  a  network  specification. 

C.2  JEE  BASIC  FEATURES  BE  PRONET 

PRONET  is  composed  of  two  complementary  sublanguages:  a  network 
specification  language,  NETSLA,  and  a  process  description  language,  ALSTEN. 
Programs  written  in  PRONET  are  composed  of  network  specifications  and  process 
descriptions.  Network  specifications  initiate  process  executions  and  oversee 
the  operations  of  the  processes  they  have  initiated.  The  overseeing  capacity 
of  network  specifications  is  limited  to  the  maintenance  of  a  communication 
environment  for  a  collection  of  related  processes.  The  processes  initiated  by 
a  network  specification  can  be  simple  processes,  in  which  case  the  activities 
of  the  processes  are  described  by  ALSTEN  programs,  or  they  can  be  "composite 
processes",  in  which  case  their  activities  are  described  by  a  "lower-level" 
network  specification. 

ALSTEN  is  an  extension  of  Pascal  which  enables  programmers  to  describe 
the  activities  of  sequential  processes.  During  their  execution,  processes  may 
perform  operations  that  cause  events  to  be  announced  in  their  overseeing 
network  specification.  Network  specifications,  written  in  NETSLA,  describe 
the  activities  that  are  performed  when  an  executing  process  'announces'  an 
event.  This  chapter  describes  the  mechanisms  that  enable  processes  to 
announce  events  and  the  network-level  activities  that  can  be  performed  in  han¬ 
dling  an  announced  event.  Two  principles  have  Influenced  the  design  of  these 
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features:  Independence  of  process  descriptions  and  distributed  execution  of 
network  specifications. 

C.2. 1  The  Features  of  AT-STgn 

Pascal  was  chosen  as  a  basis  for  process  descriptions  because  of  its 
simplicity,  its  strong  type  checking  and  the  availability  of  an  extendable 
oompiler.  ALSTEN  is,  for  the  most  part,  an  extension  of  Pascal  with  the 
exception  of  the  'file'  interface  provided  by  Pascal.  Process  descriptions 
written  in  ALSTEN  communioate  with  their  surrounding  environment  primarily 
through  locally  declared  ports  which  are  visible  to  their  overseeing  network 
specification.  Hence,  the  Pascal  ’file’  interface  has  been  replaced  by  port 
declarations  and  message  transmission  operations  in  ALSTEN. 

This  section  describes  the  ALSTEN  features  associated  with  messages 
transmission  and  process-defined  events.  Each  message  transmission  initiated 
by  a  process  causes  an  event  to  be  announced  in  the  network  specification 
which  oversees  the  operations  of  the  process.  In  handling  this  event,  the 
overseeing  network  specification  determines  where  the  message  is  to  be 
delivered  and  how  the  communication  environment  being  maintained  is  to  be 
altered  as  a  result  of  transmitting  the  message.  While  this  provides  a  power¬ 
ful  mechanism  for  dynamic  reconfiguration  of  logical  oonmunlcation  networks 
and  maintains  a  high  degree  of  independence  in  process  descriptions,  a  more 
flexible  mechanism  of  transmitting  information  from  an  executing  process  to 
its  overseeing  network  specification  is  often  useful.  In  ALSTEN,  this 
mechanism  is  provided  by  event  declarations  and  an  'announce'  operation. 

C.2.1.1  Message  Transmission  Operations 

Message  transmission  is  the  primary  mechanism  by  which  executing  proces¬ 
ses  communicate  with  their  other  objects  in  their  environment  (their  over¬ 
seeing  network  specification  and  processes).  The  basic  message  transmission 
operations  of  ALSTEN  are  'send'  and  'receive'.  Both  operations  are  specified 
"in-line",  as  are  the  'read'  and  'write'  operations  of  Pascal  (and  in  oontrast 
to  the  'interrupt  handling'  reoelve  of  Mininet  [LiveSO]). 

The  send  operation  of  ALSTEN  is  best  classified  as  a  'buffering'  opera¬ 
tion  with  partial  'blocking'.  When  a  process  executes  a  send  operation,  its 
(logical)  execution  is  blocked  until  all  events  caused  by  the  message  trans¬ 
mission  are  handled.  Handling  a  message  transmisson  event  may  involve  an 
alteration  of  the  logical  network  which  oversees  the  execution  of  the  process 
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sending  the  message  and  delivery  of  the  message  to  any  number  of  ’receiving* 
processes.  Message  ’delivery*  does  not  require  that  the  ’receiving*  process 
perform  a  receive  operation,  but  does  block  execution  of  the  sending  process 
until  the  message  value  has  been  copied  into  the  '1PC  space'  of  the 
’receiving’  process.  The  address  space  of  all  ALSTEN  processes  is  partitioned 
into  an  ’IPC  space’  and  a  ’manipulation  space’.  The  'IPC  space’  consists  of 
queues  of  messages  which  have  been  delivered  to  the  process  but  have  not  been 
’received*.  The  'manipulation  space'  of  a  process  contains  the  values  of 
variables  which  are  local  to  the  process. 

To  receive  a  message,  a  process  must  wait  until  an  acceptable  message  is 
available  in  its  'IPC  space'.  When  it  has  been  completed,  the  receive  opera¬ 
tion  of  ALSTEN  has  the  effect  of  transferring  the  message  received  from  the 
’IPC  space'  of  the  receiving  process  to  its  ’manipulation  space'. 

Executions  of  the  send  and  receive  operations  of  ALSTEN  are  specified  by 
send  and  receive  statements.  The  syntax  of  these  statements  is  shown  in 
Figure  1 .  These  statements  are  introduced  into  the  grammar  of  Pascal  [Jens74] 
as  new  variations  of  the  'simple  statement'. 

<send  stmt>  : : = 

send  [<expr>]  <bound  port  denoter> 

<receive  stmt>  ::=  <simple  receive>  i  Conditional  receive> 
<simple  recelve>  ::= 

receive  [<variable>]  from  <free  port  denoter> 

Conditional  recelve>  ::s 

when  {<recelve  part>}  [Ctherwise  part>]  end 
<recelve  part>  ::=  Clrnple  reoelve>  fdo  <stmt>] 

Ctherwise  part>  otherwise  Ctmt> 

Figure  1,  Send  end  Reoelve  Statements  in  ALSTEN 

The  'send  stmt’  causes  the  value  of  the  expression  to  be  transmitted 
through  the  output  port  identified  by  the  'bound  port  denoter'.  The  'simple 
receive'  causes  the  'variable*  to  be  assigned  the  value  of  the  next  message  to 
be  received  from  the  port  identified  by  the  'free  port  denoter'.  If  any  of 
the  simple  receives  in  a  'conditional  receive'  can  succeed  immediately,  one  is 
chosen  arbitrarily  and  the  statement  following  the  corresponding  is 
executed.  Otherwise,  when  there  is  no  'otherwise  part',  the  execution  of  the 
process  is  blocked  until  one  of  the  receive  statements  can  succeed.  If  none 
of  the  receive  statements  succeed  Immediately  and  there  is  an  'otherwise 
part',  the  statement  following  the  otherwise  is  executed.  This  control  struc¬ 
ture  presents  a  restricted  form  of  the  Ada  select  [D0D8O]. 
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C.2.1.2  Ports  for  Message  Transmission 

To  emphasize  independence  of  process  descriptions,  message  transmission 
operations  are  issued  to  locally  declared  'ports’.  The  ports  of  a  process 
description  are  visible  to  network  specifications  that  create  instances  of 
processes  which  execute  the  process  description.  Simple  ports  are  declared 
with  a  direction  (’in'  or  'out')  and  an  associated  message  type.  The 
associated  message  type  defines  how  messages  transmitted  through  a  port  are  to 
be  interpreted.  A  message  type  can  be  a  'signal'  (only  control  information  is 
transmitted)  or  any  data  type  which  does  not  contain  pointer  or  file  com¬ 
ponents. 

The  notion  of  'server'  processes  has  had  a  significant  impact  on  the 
design  of  the  message  transmission  features  of  PRONET.  Server  processes  are 
characterized  by  two  properties:  first,  a  server  process  must  respond  to 
requests  from  an  unknown  number  of  'user'  processes  and,  second,  it  must 
ensure  that  each  response  is  directed  toward  the  process  that  generated  the 
corresponding  request.  Vhen  using  server  processes  and  user  processes  in 
different  programs,  it  may  be  necessary  to  impose  'intermediary'  processes  on 
one  or  more  of  the  communication  paths  between  a  server  and  a  user.  An 
intermediary  may  mediate  between  differing  message  formats  or  communication 
protocols. 

The  ALSTEN  features  related  to  the  description  of  server  processes  are 
'port  groups',  'port  sets'  and  'port  tags'.  Port  groups  provide  a  means  for 
collecting  a  number  of  simple  ports  into  a  single  bundle.  A  'bidirectional 
port'  would  be  a  port  group  containing  two  simple  ports,  one  input  and  one 
output,  each  with  an  associated  message  type.  Port  sets,  on  the  other  hand, 
are  used  to  denote  collections  of  identical  ports— either  simple  ports  or  port 
groups.  Port  sets  provide  server  processes  with  a  mechanism  for  communicating 
with  an  unknown  number  of  user  processes.  Each  element  in  a  port  set  is 
assumed  to  be  associated  with  a  unique  user— if  the  port  set  is  a  collection 
of  port  groups,  the  simple  ports  in  each  port  group  may  be  connected  directly 
to  the  user  or  to  intermediaries.  In  order  that  a  server  may  restrict  its 
communications  to  a  particular  user,  we  introduce  port  tag  variables.  Port 
tag  variables  are  declared  to  range  over  the  members  of  a  single  port  set. 
The  value  of  a  port  tag  variable  can  be  set  in  a  receive  statement  and  may  be 
used  in  subsequent  send  and  receive  statements. 
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The  syntax  for  declaring  ports  and  port  tag  variables  in  ALSTEN  is  shown 
in  Figure  2.  Port  declarations  appear  in  the  header  of  a  process  description 
and  hence,  the  definition  of  any  'msg  type*  must  appear  outside  of  the  process 
description  (unless  it  is  a  standard  type;  e.g. ,  integer,  real,  signal,  etc.). 
This  is  necessary  as  these  definitions  must  be  shared  by  other  processes  (that 
either  send  or  receive  messages  of  type  'msg  type')  and  any  network  specifica¬ 
tion  that  oversees  the  operations  of  processes  executing  the  process  descrip¬ 
tion.  The  nonterminal  <port  tag  type>  is  Introduced  as  a  new  'type'  in  the 
syntax  presented  in  [Jens74]. 

<port  decl>  ::=  <simple  port  decl>  I  <port  group  decl> 

<slmple  port  decl>  ::  = 

-Port  Last]  <port  id>  <direction>  <msg  type> 

<port  id>  ::=  <id> 

<direction>  ::=  An  !  out 
<msg  type>  ::=  <type  id> 

<port  group  decl>  ::= 

port  Last]  <port  id>  '(*  <subport  list>  ’)' 

<subport  list>  ::=  <subport  ded>  <subport  decl>} 

<subport  decl>  ::=  <subport  id>  <direction>  <msg  type> 

<subport  id>  ::=  <id> 

<port  tag  type>  ::=  tag  of  <port  id> 

Figure  2.  Port  and  Port  Tag  Declarations  in  ALSTEN 

Figure  3  presents  the  ALSTEN  syntax  for  denoting  port  instances  in  send 
and  receive  (and  announce)  operations.  A  'bound  port  denoter'  whose  'simple 
port  denoter'  identifies  a  'port  set*  must  contain  a  'use  tag  part'  to 
identify  the  specific  instance  of  the  port  set  being  denoted.  Recalling  the 
syntax  of  the  send  operation  presented  in  Figure  1 ,  the  message  type  of  the 
port  denoted  in  a  send  operation  must  be  "name  equivalence"  compatible  with 
the  type  of  the  expression  being  transmitted  (if  the  message  type  is  'signal' 
no  expression  can  be  present).  A  similar  restriction  holds  for  receive 
operations. 

<bound  port  denoter >  <slmple  port  denoter > 

!  <simple  port  denoter>  <use  tag  part> 

<simple  port  denoter>  ::=  <port  id> 

I  <port  id>  ' . '  <subport  id> 

<use  tag  part>  : :=  use  <port  tag  variable> 

<port  tag  variable>  <variable> 

<free  port  denoter>  ::s  <bound  port  denoter> 

I  <slmple  port  denoter>  <set  tag  part> 

<set  tag  part>  set  <port  tag  variable> 

Figure  3.  Denoting  Porte  in  ALSTEN 
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The  use  of  port  sets,  port  groups  and  port  tag  variables  is  illustrated 
in  Figure  4  which  presents  the  description  of  a  simple  server  process.  This 
process  implements  a  shared  sequence  of  numbers.  In  line  2  a  port  set, 
'user*,  is  declared.  The  elements  of  this  port  set  are  instances  of  a  port 
group  containing  an  input  port  'req'  and  an  output  port  'rsp'.  Lines  10  and 
11  illustrate  the  setting  and  subsequent  use  of  the  port  tag  variable 
'uaer_tag'  (declared  on  line  4).  In  line  10,  the  value  of  'user_tag'  is  set 
to  indicate  which  instance  of  the  port  set  'user*  the  request  is 
received  from.  The  value  of  ’ user_tag*  is  used  in  line  11  to  direct  the 
response  to  the  element  in  the  set  ’user'  from  which  the  request  was  received. 

1  process  script  shared_sequence 

2  port  set  user  (req  In  signal;  rsp  out  integer); 

3  icnc 

4  user_tag  :  tag  of  user ; 

5  sequence_val  :  integer; 

6  begin 

7  sequence_val  :=  0; 

8  while  true  4a 

9  begin 

10  receive  from  user. req  set  user_tag ; 

11  send  sequence_val  in  user.rsp  use  user_tag; 

12  sequence_val  :s  sequence_val  +  1 

13  end  (•  while  •) 

14  end  (*  shared  sequence  *) 

Figure  4.  A  Simple  Server  Process 
C.2.1.3  Process-Defined  Events 

As  has  been  discussed,  the  execution  of  a  send  operation  causes  a  mes¬ 
sage  transmission  event  to  be  announced  in  the  network  specification  which 
oversees  the  operation  of  the  process  which  executes  the  send  operation. 
Thus,  the  transmission  of  a  message  may  lead  to  a  reconfiguration  of  the  com¬ 
munication  environment  used  by  the  sending  process.  This  is  particularly 
useful  in  providing  a  mechanism  for  dynamic  reconfiguration  of  logical  com¬ 
munication  networks  while  maintaining  a  high  degree  of  independence  in  process 
descriptions.  However,  a  more  flexible  Interface  between  processes  and  their 
overseeing  network  specifications  whloh  allows  processes  to  indicate 
significant  ohanges  in  their  state  or  possible  errors  in  communications  is 
often  useful.  In  ALSTEN,  this  interface  is  provided  by  event  declarations  and 
the  ■nnnnnrift  statement. 

Process  descriptions  may  declare  event  names  and  subsequently  'announce* 
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these  events  during  their  execution.  The  activities  to  be  performed  (If  any) 
when  an  executing  process  announces  an  event  are  described  in  the  network 
specification  which  oversees  the  execution  of  the  process.  Hence,  this 
mechanism  provides  a  flexible  Interface  between  the  process-level  and  the 
network-level  while  maintaining  the  separation  of  these  levels. 

Event  declarations  have  the  form: 

<event  decl> 

event  <event  name>  T about  <port  id>] 

Event  declarations  appear  in  the  header  of  a  process  description  and  follow 
the  port  declarations  of  the  process  description.  The  event  name  is  an 
identifier  which  can  be  used  in  subsequent  announce  operations.  The  optional 
'about  part*  allows  the  process  to  associate  an  event  with  a  set  of  ports. 
This  is  useful  in  indicating  erroneous  communication  (either  protoool  or 
consistency)  on  a  specific  port. 

The  announce  operation  of  ALSTEN  is  introduced  as  a  statement  (a  'simple 
statement'  in  the  grammar  of  Pasoal  [Jens74]): 

<announoe  stmt>  announce  <event  name> 

f about  <bound  port  denoter>] 

The  'event  name*  must  be  the  name  of  a  declared  event.  Further,  if  this  event 
has  been  declared  with  an  associated  port  set,  the  about  clause  must  be 
present  and  must  denote  an  lnstanoe  of  the  associated  port  set. 

An  example  of  process-defined  events  is  presented  in  Figures  5  and  6. 
Figure  6  presents  the  script  far  instances  of  'mailbox'  processes.  The  types 
used  in  the  mailbox  process  script  are  shown  in  Figure  5.  In  this  case,  the 
event  'mallboxjanpty' — declared  in  line  5  of  Figure  6  and  announced  in  line 
24— is  used  to  indicate  a  significant  change  in  the  Internal  state  of  the 
process. 

1  tvne 

2  letter  *  array  [1..120]  SiL  char: 

3  user _rap_kinds  s  (empty,  mail^JLtem); 

4  user _rsp  =  record 

5  case  kind  :  user_rsp_kinds  SiL 

6  empty  :  (); 

7  mailJLtem  :  (let  :  letter) 

8  end :  (•  user_j»sp  •) 

Figure  5*  Mailbox  Prooesa  Script  Type  Definitions 
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1  process  script  mailbox 

2  port  input  Jji  letter; 

3  port  output  out  letter; 

4  port  control  in  signal; 

5  event  mailbox_empty ; 

6  var 

7  next_rsp  :  user_rsp; 

8  done  :  boolean; 

9  begin 

10  repeat 

1 1  receive  from  control ; 

12  next_j*sp.  kind  :=  oaiJ^jLtem; 

13  done  :  =  false; 

14  repeat 

15  when 

16  receive  next_rsp.let  from  input  in 

17  send  next_rsp  in  output; 

18  .Q-thfinfiae 

19  done  :=  true 

20  end  (•  when  •) 

21  until  done 

22  next_rsp. kind  :=  empty 

23  send  next _rsp  in  output; 

24  announce  mallbox_empty 

25  until  false 

26  end  (•  mailbox  script  •) 

Figure  6.  The  Mailbox  Process  Sorlpt 

This  process  script  Implements  a  simple  mailbox  which  acts  as  a 
repository  for  'letters*.  Responses  from  the  mailbox  will  be  of  type 
,uaer__rsp*  which  Is  defined  In  lines  3-8  of  Figure  5.  Upon  reception  of  a 
signal  on  its  'control'  port  (line  11  of  Figure  6),  the  mailbox  forwards  the 
letters  on  its  'Input'  port  to  its  'output'  port  (lines  16  and  17).  When  there 
are  no  letters  remaining  on  the  mailbox  input  port,  the  process  sends  an 
'empty'  message  to  its  output  port  and  announces  the  event  'mall box_empty' 
(lines  22-24).  The  process  then  cycles  to  wait  for  the  next  request  to 
deliver  its  ' contents '  ( line  11). 

The  structure  of  this  mailbox  is  natural  considering  the  semantics  of 
the  ALSTEN  send  operation.  Because  senders  do  npt  wait  until  their  messages 
are  received,  there  Is  no  need  for  the  mailbox  to  receive  messages  as  they  are 
sent.  Henoe,  the  mailbox  does  not  maintain  an  Internal  representation  of  its 
oontents  but  rather,  relies  on  the  run-time  support  environment  to  maintain 
oolleotlona  of  letters.  A  simple  mall  system  that  uses  this  mailbox  process 
sorlpt  and  illustrates  the  handling  of  prooess-defined  events  will  be 
presented  In  the  next  seotlon. 
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C.2.2  Ibft  Features  of  NETSLA 

The  features  of  NETSLA  are  aimed  at  specifying  the  Initial  configuration 
and  subsequent  modifications  of  a  communication  environment.  The  overriding 
principle  followed  in  the  design  of  these  features  is  that  of  "centralized 
expression— decentralized  execution”  [Llve80].  Centralized  expression  is 
important  in  presenting  the  abstraction  to  be  supported  by  network 
specifications.  All  of  the  inter-process  relationships  that  describe  a  com¬ 
munication  environment  appear  in  a  single  network  specification.  However, 
this  communication  environment  is  not  maintained  in  a  centralized  fashion. 
Processes  maintain  their  communication  environment  indirectly.  When  they 
execute  send  or  announce  operations,  processes  perform  the  activities 
specified  by  their  overseeing  network  specifications;  however,  the  nature  of 
these  activities  are  unknown  to  the  process. 

C. 2.2.1  An  Overview  of  Network  Specifications 

The  syntax  for  specifying  a  network  is  shown  in  Figure  7.  Like  the 
header  of  an  ALSTEN  process  script,  a  network  header  can  contain  port  and 
event  declarations.  Network  specifications  that  do  declare  ports  and/or 
events  will  be  used  as  "composite  processes*  in  higher-level  network 
specifications. 

<network  specif ication>  <network  header> 

{<process  class  specif ioation>} 

{<event  handling  olause>} 
reinitialization  elause>] 
end  <ldentifler> 

<network  header >  : :=  network  <net  id>  »;' 

{<port  decl>}  {<event  decl» 

<process  class  specifloatlon>  ::= 
process  class  <process  ld> 

[eprocess  attributes)*] 

{<port  decl>} 

(<event  decl>) 
end  <process  ld> 

eprocess  attrlbutes>  ::=  attributes 
efleld  llst>  Jaml  attributes 

Figure  7*  Network  Specifications  in  NETSLA 

The  process  olass  specifications  contained  in  a  network  specification 
capture  those  portions  of  a  process  description  that  are  visible  in  a  network 
specification— its  name,  port  declarations  and  event  declarations— and  a 
'process  attributes'  part.  The  name,  port  declarations  and  event  declarations 
stated  in  process  class  specification  are  a  reiteration  of  the  process  script 
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or  network  specification  header  which  is  used  to  implement  instances  of  the 
process  class.  Process  attributes  are  used  to  identify  the  characteristics 
associated  with  instances  (processes)  in  a  process  class.  The  implementation 
of  a  process  olass  may  be  a  network  specification  (in  which  case  instances  of 
the  process  class  are  actually  "composite  processes")  or  a  process  script 
written  in  ALSTEN.  In  either  case,  this  implementation  is  not  contained  in 
the  network  specification.  Process  implementations  are  compiled  separately 
and  compatibility  between  specification  and  implementation  is  checked  in  a 
pre-linkage  phase.  The  remaining  portions  of  a  network  specification,  the 
event  handling  clauses  and  the  optional  initialization  clause,  describe  the 
instantiation  and  subsequent  modifications  of  the  logical  communication 
network  which  is  maintained  by  the  network  specification. 

When  a  logical  network  is  instantiated,  its  initialization  clause  is 
elaborated.  This  initialization  clause  is  used  to  oreate  a  collection  of 
processes  and  delineate  communication  paths  between  them.  A  simple  network 
specification  is  illustrated  in  Figure  8.  One  process  class,  'proo_class' 
(lines  2-5),  is  used  in  this  network  specification.  Instantiation  of  the 
logical  communication  network  is  specified  in  lines  6-14  and  involves  the 
creation  of  three  processes  (lines  7-9)  and  the  establishment  of  communication 
paths  between  them  (lines  10-13).  The  statement  1  connect  prod. output  to 
proc3. input'  (line  10)  specifies  that  the  messages  sent  to  the  output  port  of 
'procl'  are  to  be  transmitted  to  the  input  port  of  'proc3'.  A  graphical 
representation  of  the  logical  communication  network  established  by  this 
network  specification  is  shown  in  Figure  9. 

1  network  statlc_.net 

2  .process  class  proq_class 

3  Port  input  la  integer; 

4  port  output  out  integer; 

5  end  proQ_class 

6  initial 

7  create  prod  :  proc1_olass; 

8  oreate  proc2  :  proc1_class; 

9  oreate  proc3  :  proc1_class; 

10  .connect  prod. output  la  proc3. input; 

11  connect  proo2. output  la  proc3. input; 

12  connect  proo3. output  la  prod. input; 

13  connect  proc3. output  la  proc2. input; 

14  end  statics_net 

Figure  8.  A  Simple  Network  Speolfloatlon 
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Figure  9*  Graphical  Representation  of  the  Simple  Network 


This  simple  example  illustrates  two  of  the  simple  activities  that  can  be 
performed  in  a  network  specification,  creation  and  connection.  As 
illustrated,  creation  involves  the  binding  of  a  name  to  each  process  instance 
as  it  is  created.  In  NETSLA  these  and  other  name  bindings  are  limited  to  the 
clause  in  which  they  appear.  Hence,  the  names  'prod',  *proc2'  and  •procB' 
may  be  used  throughout  the  initialization  clause  but  would  not  be  usable  in 
other  clauses  unless  they  were  explicitly  bound  to  objects  (process  or  port 
Instances)  in  these  clauses.  Connection  is  shown  in  lines  10-13.  One-to-one, 
many-to-one  (messages  are  ordered  by  time  of  arrival)  and  one-to-many  (mes¬ 
sages  are  replloated)  connections  between  ports  can  be  specified.  To  be  con¬ 
nected,  ports  must  be  compatible  in  both  message  type  and  direction.  Message 
type  compatibility,  like  type  compatibility  in  PASCAL,  is  based  on  named 
equivalence  of  types.  The  definition  of  port  direction  compatibility  has  two 
components : 

i)  If  one  port  is  a  network-level  port  (declared  in  the  network  header)  and 
the  other  is  a  process-level  port  (declared  in  a  process  class 
specification),  the  ports  must  have  similar  directions, 
ii)  If  both  ports  are  process-level  ports  or  both  are  network-level  ports, 
the  ports  must  have  opposite  directions. 


C.2.2.2  Brent  Handling 

The  initialization  clause  is  sufficient  for  the  description  of  static 
networks.  However,  other  features  are  needed  to  describe  dynamically  changing 
communication  environments.  In  PRONET,  these  features  are  based  on  the  notion 
of  network  events.  During  their  execution,  processes  may  perform  operations 
which  announce  events  to  their  overseeing  network  specification  (using  send  or 
jUtOUAfift)*  NETSLA  provides  two  mechanisms  for  handling  announced  events  in 


1  O  «.  • 


.  *  . 


I  *  t  «  •  «  ■  •  ■ 


"  W-V-* 


Page  56 


PRONET 


Appendix  C 


I 


network  specifications:  connections  and  'event  handling  clauses'. 

Connections  are  one  mechanism  for  handling  the  event  associated  with 
message  transmission  on  a  port.  When  a  connection  between  two  ports  has  been 
established,  this  'message  transmission'  event  is  handled  by  transferring  the 
message  from  the  sending  port  to  the  receiving  port.  The  connection  mechanism 
is  distinct  from  the  event  clause  meohanlsm  in  three  ways:  connections  can  be 
established  or  broken  dynamically,  the  activities  of  this  mechanism  are 
defined  by  the  language  and  connections  can  only  be  used  to  handle  the  'mes¬ 
sage  transmission'  event. 

Event  handling  clauses  are  more  flexible  in  the  types  of  events  they  can 
handle  and  the  activities  they  can  specify  but  are  established  statically  and 
cannot  be  'broken'.  Event  handling  clauses  provide  a  capability  to  specify 
the  activities  that  are  to  be  performed  when  a  message  is  transmitted  (if  a 
simple  connection  is  not  sufficient),  when  a  process  defined  event  is 
announced,  when  an  element  of  a  network  declared  port  set  is  created  or  when 
an  element  of  a  network  declared  port  set  is  removed.  The  syntax  of  the  event 
handling  clauses  of  NETSLA  is  illustrated  in  Figure  10. 

<event  handling  clause>  ::s  <arrive  olause> 

!  <enter  dause>  !  <leave  dause>  I  <when  clause> 

<arrive  dause>  ::=  <arrive  clause  header> 

<activity  list>  end  arrive 
<arrive  dause  header >  ::=  arrive  [<id>]  si n 

<arrive  port  binding>  f  from  <processs  binding>] 

<arrive  port  binding>  ::=  [<subport  id>  af]  <port  binding> 
<process  binding>  ::=  [<id>  ’:']  <process  class  name> 

<port  blnding>  ::=  [<id>  ':*]  <port  set  name> 

<enter  dause>  ::=  <enter  clause  header> 

<activity  list>  end  enter 

<enter  dause  header >  ::=  enter  <port  bindlng>  do 
<leave  dause>  ::  =  <leave  clause  header > 

<activity  list>  end  leave 

<leave  dause  header>  ::=  leave  <port  binding>  do 
<when  clause>  ::=  <when  dause  header > 

<activity  list>  end  when 
<when  dause  header>  : :=  when  <event  name> 
announced  3a.  <process  blndlng>  dfi 
<initialization  dause>  ::s  initial  <aotivity  list> 

Figure  10.  NETSLA  Event  Handling  and  Initialization  Clauses 

The  bindings  in  the  various  event  clause  headers  are  used  to  bind  names 
to  the  objects  (message,  process  Instance  or  port  Instance)  involved  in  the 
event  being  handled.  For  example,  when  clauses  are  used  to  handle  the 
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announcement  of  process-defined  events.  As  such,  the  following  'when  clause 
header'  could  be  used  in  a  network  specification  to  handle  the  'mailbox_empty' 
event  when  this  event  is  announced  by  a  process  executing  the  mailbox  process 
script  shown  in  Figure  6. 

when  mailbox_empty  announced  by  box  :  mailbox  jte 

In  this  case,  the  event  being  handled  is  the  process-defined  event 
’mailbox_empty'  and  the  name  'box'  is  bound  to  the  instance  of  the  mailbox 
process  that  announced  the  event.  When  clauses  are  also  used  to  handle  the 
standard  event  'done'.  Whenever  an  executing  process  terminates  its 
activities,  the  standard  event  'done'  is  announced  to  its  overseeing  network 
specification. 

Arrive  clauses  are  used  to  handle  message  transfer  events  when  simple 
connections  between  ports  are  not  sufficient.  An  arrive  clause  can  be 
associated  with  the  arrival  of  a  message  on  a  network-level  'in'  port,  in 
which  case  the  optional  'from  process  binding'  is  not  specified.  An  'arrive 
clause'  can  also  be  associated  with  the  arrival  of  a  message  on  an  'out'  port 
of  a  process  Instance,  in  which  case  the  'from  process  binding'  identifies  the 
process  class  of  interest  and  can  be  used  to  bind  a  name  to  the  instance  which 
is  transmitting  the  message.  The  first  identifier  in  an  'arrive  clause 
header'  is  used  to  bind  a  name  to  the  message  value  being  transmitted.  The 
'arrive  port  binding'  In  an  'arrive  clause  header'  Identifies  the  port  set, 
binds  a  name  to  the  port  group  instanoe  through  which  the  message  is  transmit¬ 
ted  and  identifies  the  subport  being  used. 

When  an  event  is  announced,  two  possibilities  exist:  no  'handlers' 
(connections  or  event  handling  clauses)  are  associated  with  the  event  or  at 
least  one  'handler'  is  associated  with  the  event.  In  the  latter  situation, 
the  activities  specified  by  each  handler  are  performed  on  the  event  (in  an 
arbitrary  order).  For  example,  when  multiple  connections  are  established  for 
a  port,  any  message  transmitted  through  the  port  is  replicated  and  delivered 
along  each  of  its  connections.  When  no  handlers  are  associated  with  an  event, 
its  announcement  has  no  effect  on  the  communication  environment  being 
maintained  by  the  network  speoifioatlon.  Moreover,  the  objeot  (process  or 
overseeing  network  specification)  that  announced  the  event  cannot  determine  if 
the  event  was  handled.  For  example,  when  a  process  sends  a  message  to  a  port 
that  has  no  established  connection  or  arrive  clause,  the  message  is  removed 
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from  the  port  and  the  sending  process  cannot  determine  that  its  message  has 
not  been  delivered. 

C. 2.2.3  Simple  Activities 

<activity  list>  <activity>  <activity>} 

<activity>  <simple  activity>  I  <struotured  activity> 

<simple  activity>  ::=  <creation>  I  <termination> 

I  <removal>  |  <connection>  I  <disoonnection> 
i  <message  transmission>  I  <value  construction> 

!  <event  announcement>  I  <attribute  asslgnment> 

<creation>  ::=  create  <process  binding> 

I  create  <port  bindlng>  <process  instance> 

<termination>  ::=  terminate  |  terminate  <process  instance> 
<removal>  ::=  remove  <process  instance> 

!  remove  <port  group  instance> 

<connection>  ::=  connect  <port  instance>  <port  instance> 
<disconnection>  ::=  disconnect  <port  instanee> 

I  disconnect  <port  instance>  from  <port  instance> 

Cmessage  transmission>  ::= 

send  <msg  value>  iQ.  <port  instance> 

< value  construction>  ::=  construct  <id>  <type  name> 

'['  <component  assignment  list>  ']' 

<attribute  assignment>  ::=  <attribute  denoter>  ':  =  •  <value> 

< event  announcement  ::=  announce  <event  name> 

[about  <port  group  instanoe>] 

Figure  11.  Simple  Activities  in  HETSLA 

NETSLA  provides  nine  basic  activities  which  oan  be  used  in  initializa¬ 
tion  and  event  handling  clauses:  creation,  termination,  removal,  connection, 
disconnection,  message  transmission,  attribute  assignment,  event  announcement 
and  value  construction.  The  syntax  used  in  specifying  these  activities  in 
NETSLA  is  shown  in  Figure  1 1 . 

The  creation  activity  oan  be  applied  to  a  process  class  or  a  port  set  of 
a  prooess  instance.  In  the  first  of  these  variations,  a  new  instance  of  the 
process  class  executing  the  process  script  or  network  specification  associated 
with  the  process  class  is  instantiated.  The  'prooess  binding'  part  of  the 
creation  activity  is  used  to  Identify  the  process  class  and  bind  a  name  to  the 
newly  created  instance.  This  form  of  the  oreation  activity  was  illustrated  in 
the  network  specification  illustrated  in  Figure  8.  The  second  variation  of 
this  activity  creates  a  new  port  group  Instance  in  a  port  set  on  a  prooess 
instance.  The  'port  binding'  part  of  this  variation  is  used  to  identify  the 
port  set  and  bind  a  name  to  the  newly  created  port  group  instanoe.  For  exam¬ 
ple,  a  network  specification  containing  the  prooess  class  speoifioation: 
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process  class  shared_sequence 

port  set  user  (req  in  signal;  rsp  out  integer); 
end  shared_sequence 

could  contain  the  activities: 

create  server  :  shared_sequence ; 
create  user_port1  :  user  nn  server; 

(Recall  the  ' shared_sequence '  script  of  Figure  4.)  The  latter  of  these 
activities  creates  an  element  of  the  port  set  'user'  on  the  process  instance 
identified  by  'server'.  This  newly  created  element  is  bound  to  the  name 
'user_port1 ' . 

The  termination  activity  can  be  applied  to  a  process  instance  or  to  the 
entire  logical  network  being  maintained  by  a  network  specification.  When  this 
activity  is  applied  to  a  process  instance,  the  activities  of  the  process  are 
terminated  and  no  further  messages  or  events  will  be  transmitted  to  or 
received  from  the  terminated  process.  When  no  process  Instance  is  specified 
in  a  termination  activity,  the  logical  network  maintained  by  the  network 
specification  is  terminated.  This  Involves  the  termination  of  all  process 
instances  executing  in  the  logical  network. 

The  removal  activity  of  NETSLA  can  be  applied  to  a  process  instance  or 
to  a  port  group  instance  on  a  process  instanoe.  In  the  latter  variation,  no 
future  messages  will  be  transmitted  through  the  port  Instance  which  has  been 
removed.  Attempts  to  transmit  messages  through  a  removed  port  will  have  no 
effect.  When  the  removal  activity  is  applied  to  a  process  instance,  the 
process  which  has  been  removed  may  continue  to  execute  and  may  generate  future 
messages;  however,  no  future  messages  will  be  transmitted  to  the  identified 
process.  (In  effect,  all  'in'  ports  on  the  process  instance  are  removed.) 
This  somewhat  unusual  definition  of  'removal'  derives  from  two  considerations: 
process  execution  and  the  ALSTEN  send  operation.  Because  messages  are 
buffered  using  the  ALSTEN  send  operation,  processes  may  have  meaningful  work 
to  complete  before  their  activity  is  terminated.  Further,  processes  have  an 
inherent  termination  built  into  their  descriptions. 

The  connection  activity  involves  two  port  instances  and  once  performed 
ensures  that  all  messages  transmitted  through  the  first  port  instance  will  be 
transmitted  through  the  second  until  this  connection  is  broken  by  a  subsequent 
disoonneotion  activity.  This  activity  was  illustrated  in  the  simple  network 
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specification  shown  in  Figure  8. 

The  disconnection  activity  applies  to  ports  and  has  two  variations.  In 
the  first,  two  (presumably  connected)  port  instances  are  identified.  This 
variation  breaks  a  previously  established  connection  between  the  identified 
ports.  In  the  second  variation,  only  one  port  instance  is  identified.  This 
variation  breaks  all  previously  established  connections  involving  the 
identified  port.  Once  a  connection  between  two  port  instances  has  been 
broken,  future  messages  transmitted  through  the  first  port  will  no  longer  see 
the  connection  and  hence,  will  not  automatically  be  transmitted  through  the 
second  port. 

Message  transmission  involves  the  transmission  of  a  message  value 
through  a  port  instance  and  has  the  same  semantics  as  the  send  operation  of 
ALSTEN. 

NETSLA  does  not  provide  general  variable  declaration  or  assignment 
mechanisms  as  does  Pascal  (and  ALSTEN).  Instead,  NETSLA  is  based  on  a  dynamic 
binding  of  identifiers  to  values  in  event  clause  headers,  creation  activities 
or  structured  activities  (discussed  in  the  next  section).  For  example,  the 
value  of  the  message  being  transmitted  can  be  bound  to  an  identifier  in  an 
'arrive  clause  header*.  While  this  is  sufficient  for  most  purposes, 
occasionally  there  arises  a  need  to  construct  values  of  (Pascal)  structured 
types.  The  value  construction  activity  of  NETSLA  has  been  introduced  to  fill 
this  need.  Value  construction  involves  the  assignment  of  values  to  the  com¬ 
ponents  of  a  structured  type  (the  type  of  the  value  being  constructed  is  given 
by  'type  name')  and  the  binding  of  an  identifier  to  the  value  constructed. 
The  Identifier  can  then  be  used  in  later  activities  to  refer  to  the  value 
constructed. 

Periodically,  the  attributes  of  a  process  instance  will  need  to  be  up¬ 
dated  to  reflect  changes  in  the  characteristics  of  the  process  instance.  The 
attribute  assignment  activity  is  provided  to  enable  the  updating  of  the 
attributes  of  a  process  instance.  An  attribute  of  a  prooess  instance  is 
denoted  by  a  conjunction  of  a  'process  instance'  and  an  attribute  name.  The 
type  'value'  assigned  to  an  attribute  of  a  process  instance  must  be  compatible 
with  the  type  of  the  attribute. 

Like  process  descriptions,  network  specifications  that  aot  as  "composite 
processes"  may  need  to  announce  events  to  their  overseeing  network  specifica- 
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tion  while  they  are  active.  This  capability  is  provided  by  the  event 
announcement  activity. 

C.2.2.4  Structured  Activities 

NETSLA  provides  structured  activities  for  alternation,  iteration  and 
location.  The  syntax  for  the  alternation  activity  is  presented  in  Figure  12. 
This  activity  is  derived  from  the  case  statement  of  Pascal  and  provides  a 
mechanism  for  specifying  alternative  lists  of  activities  to  be  performed  on 
the  basis  of  an  available  value. 

structured  activity>  ::=  <alternation>  I  <lteration> 

I  <location> 

<alternation>  ::=  case  <value>  of 

{<case  list  element>)  [<otherwise  part>]  end  j cafifi 
<case  list  element>  ::= 

<case  label  list>  *:'  *(*  <activity  list>  * ) * 

<otherwise  part>  ::=  otherwise  <activity  list> 

Figure  12.  Alternation  in  NETSLA 

The  syntax  of  the  location  and  iteration  activities  is  presented  in 
Figure  13.  These  activities  provide  a  mechanism  for  selecting  process  and 
port  instances  in  the  logical  network  maintained  by  a  network  specification. 
Both  activities  are  based  on  a  'selection  binding*  which  specifies  the 
criteria  to  be  used  in  selecting  groups  of  object  (port  and  process) 
Instances.  The  'selection  binding'  is  also  used  to  bind  names  to  the  objects 
selected. 

The  iteration  activity  is  a  looping  construct.  The  activity  list 
specified  in  the  iteration  activity  is  performed  for  each  group  of  objects 
that  meet  the  criteria  of  the  'selection  binding'.  In  each  iteration  of  the 
activity  list,  a  new  group  of  object  instances  is  selected  and  bound  to  the 
names  specified  in  the  'selection  binding'.  The  location  activity  is  a  simple 
conditional  construct.  The  activity  list  specified  in  the  location  activity 
will  be  performed  at  most  one  time  for  one  group  of  objects  that  meet  the 
criteria  of  the  'selection  binding'.  In  the  case  of  location,  if  multiple 
groups  of  object  instances  meet  the  criteria,  one  of  these  groups  is  selected 
arbitrarily  and  the  object  instances  in  this  group  are  bound  to  the  specified 
names.  In  the  case  of  iteration,  the  activity  list  specified  will  be  applied 
to  all  groups,  but  the  order  of  application  is  arbitrary.  If  no  group  of 
object  instances  meets  the  criteria  of  the  'selection  binding',  in  either 
Iteration  or  location,  the  activity  list  specified  in  the  optional  'else  part' 
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will  be  performed. 

<iteration>  <iteration  header>  <activity  list> 

[<else  part>]  end  range 

<iteration  header >  ::=  range  <selection  binding>  do 
<location>  ::=  <location  header >  <aetivity  list> 

[<else  part>]  end  find 

<location  header >  find  Selection  binding>  do 

<else  part>  else  <activity  list> 

Selection  binding>  ::=  <simple  selection  binding> 

!  <nested  selection  binding> 

<simple  selection  binding>  ::= 

<port  binding>  [<where  clause>] 

I  <process  binding>  [<where  elause>] 

<nested  selection  binding>  ::=<port  binding> 

<process  binding>  [<where  clause>] 

<where  clause>  ::=  where  <criteria> 

<criteria>  ::=  <criteria  factor > 

i  <criteria>  j2£  <criteria  factor> 

<criteria  factor >  ::=  <criteria  primary> 

I  <criteria  factor>  and  <criteria  primary> 

<criteria  primary>  not  <criteria  primary> 

i  Connectivity  criteria>  |  <attribute  criteria> 

<connectivity  criteria>  connected  <port  instance> 

rto  <port  instance>] 

<attribute  oriteria>  ::= 

<attribute  denoter>  <rel  op>  <value> 

Figure  13.  Iteration  and  Looatlon  in  NETSLA 

The  'selection  binding'  can  be  a  'simple  selection  binding'  or  a  'nested 
selection  binding'.  A  'simple  selection  binding'  is  used  to  select  a  single 
object  instance  (one  for  each  Iteration  in  the  case  of  the  iteration 
activity),  while  a  'nested  selection  binding'  identifies  a  process  instance 
and  a  port  Instance  on  the  identified  prooess. 

The  port  and  process  bindings  in  the  simple  and  nested  selection  bin¬ 
dings  identify  the  process  olass  and/or  port  set  of  interest.  The  optional 
'where  clause'  is  used  to  Impose  additional  selection  criteria  based  on  con¬ 
nectivity  or  attribute  values. 

C.2.2.5  A  Simple  Mail  System 

This  section  presents  the  design  of  a  simple  mail  system  to  illustrate 
the  basic  features  of  NETSLA.  The  mall  system  provides  servloes  that  allow 
users  to  create  numbered  mail  boxes,  read  the  mail  in  a  numbered  mailbox  and 
send  letters  to  a  numbered  mailbox.  The  type  definitions  needed  in  the  design 
of  the  simple  mall  system  are  presented  in  Figure  14. 
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1  network  simple_mail ; 

2  port  set  user 

(req  Jji  user_req ,  rsp  out  user„rsp); 

3  process  class  mailbox 

4  attributes 

5  number  :  integer 

6  and.  attributes 

7  port  input  in  letter; 

8  port  output  out  user_rsp 

9  port  control  in  signal; 

10  event  mailbox_empty ; 

11  and  mailbox 

12  arrive  msg  :  user_j*eq  an  req  a£  u  :  user  dfi 

13  case  msg. kind  a£ 

1 4  make_box  : 

15  ( find  box  :  mailbox  where 

box.  number  =  msg.  number  do. 

16  else 

17  oreate  new_box  :  mailbox; 

18  nen_box. number  :=  msg. number 

19  end  find) 

20  readjnail  : 

21  (find  box  :  mailbox  where 

box. number  =  msg. number  da 

22  connect  box.output  in  u.rsp; 

23  send  to  box. control 

24  else 

25  construct  rsp  :  user_rsp  [kind  :r  empty] ; 

26  send  rsp  ia  u.rsp 

27  end  find] 

28  sencUnail  : 

29  (find  box  :  mailbox  where 

box. number  =  msg. number  dtt 

30  send  msg. let  ia  box. input 

31  and  find) 

32  and  arrive 


33  when  nailbox_empty  annnunoad  by 

box  :  mailbox  da 

34  disconnect  box.output 

35  and  when 

36  end  simple_jnail 

Figure  16.  network  Speolfioation  for  the  Simple  Mall  System 

A  network  specification  whloh  implements  the  simple  mail  system  is  shown 
in  Figure  16.  To  match  the  external  speolfioation  illustrated  in  Figure  15, 
this  network  defines  a  port  set  'user'  (line  2).  The  dynamio  behavior  of  the 
mail  system  is  specified  in  the  arrive  (lines  12-32)  and  adfifl  (lines  33-36) 
clauses.  New  mailboxes  are  created  as  requested  when  there  are  no  mailboxes 
with  the  specified  number  (lines  15-1 9).  Reading  the  oontents  of  a  numbered 


Appendix  C 


PRONET 


Page  65 


mailbox  involves  locating  a  mailbox  instance  with  the  correct  'number' 
attribute.  If  an  acceptable  mailbox  instance  is  found,  its  output  port  is 
connected  to  the  'rsp'  subport  of  the  user  port  that  generated  the  request  and 
a  signal  is  delivered  to  the  'control'  port  of  the  mailbox  instance  (lines  21- 
33).  This  connection  is  broken  when  the  mailbox  instance  announces  the  event 
'mailbox_empty'  (lines  33-35).  If  no  mailbox  instance  can  be  found,  an  empty 
response  is  constructed  and  transmitted  to  the  response  subport  of  the  user 
port  that  generated  the  request  (lines  25  and  26). 

C.2.2.6  Event  Cls-.se  Execution 

To  achieve  decentralized  execution  of  network  specifications,  the 
activities  specified  in  an  event  handling  clause  will  be  performed — 
indirectly — by  any  process  that  announces  the  event  handled  by  the  clause. 
For  example,  any  process  that  sends  a  message  to  the  simple  mail  system  shown 
in  Figure  16  would  perform  the  activities  specified  in  the  arrive  clause 
(lines  12-32)  of  the  network  specification. 

The  activities  specified  in  an  event  handling  clause  are  best  viewed  as 
specifying  searches  and  modifications  of  a  partitioned  and  distributed 
representation  of  a  logical  communication  network.  This  representation 
contains  representations  of  all  object  (port  and  process)  instances  in  the 
logical  network  as  well  as  representations  of  the  current  connections  between 
port  instances  in  the  logical  network.  Executions  of  all  event  handling 
clauses  are  required  to  be  serializable. 

C.3  PI8SCTS8XQM 

The  language  features  presented  reflect  a  concentration  on  inter-process 
relationships  that  describe  program  structure.  Recall  that  our  goals  were  to 
provide  features  which  would  support  independence  of  processes  and  the 
description  of  process  hierarchies,  while  obtaining  information  which  would 
aid  in  the  effective  execution  of  programs.  The  network  specifications  of 
PRONET  are,  in  general,  more  useful  in  support  of  the  first  goal  than  of  the 
second.  As  PRONET  has  developed  and  the  features  in  NETSLA  have  come  to 
provide  more  power  for  describing  dynamic  reconfigurations,  the  network 
specifications  have  come  to  provide  less  useful  information  to  an  NOS.  For 
programs  whioh  can  be  described  by  a  static  network,  however,  the  features  of 
PRONET  effectively  support  both  goals. 
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PRONET  also  Includes  features  which  provide  a  programmer  with  the 
ability  to  handle  network  failures.  Programming  for  robustness  in  the  face  of 
such  failures  requires  a  considerable  alteration  of  programming  style,  but  it 
can  be  done  within  the  framework  provided  by  PRONET.  Further  discussion  of 
these  features  can  be  found  in  [Maoc82]. 
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APPENDIX  D 

FAILURE  HANDLING  IN  PRONET 

Rlohard  J.  LeBlano 
Arthur  B.  Maooabe 

D.1  INTRODUCTION 

New  features  aiding  '  design  and  description  of  distributed  programs 
are  central  to  the  design  of  PRONET  [Macc82,  LeBl82].  These  new  capabilities 
are  being  Implemented  as  extensions  to  Pascal,  but  since  they  involve  only 
interprocess  communication  and  interconnection  of  processes  via  message  chan¬ 
nels,  they  could  be  added  to  many  other  languages. 

Among  the  important  features  of  PRONET  are  the  abstraction  capabilities 
it  provides  for  the  specification  of  programs  as  logical  networks  of  proces¬ 
ses.  Network  specification  and  process  description  are  separated  in  PRONET  by 
the  division  of  the  language  facilities  into  two  sublanguages:  NETSLA 
(Network  Specification  Language)  and  ALSTEN  (an  extension  of  Pascal  for 
process  description).  These  capabilities  allow  an  encapsulated  description  of 
the  connections  between  processes,  aiding  in  the  understanding  of  complex 
programs  and  providing  information  a  distributed  operating  system  needs  for 
making  placement  and  scheduling  decisions. 

Other  programming  languages/systems  that  support  a  similar  separation 
(UNIX  [Bour78],  Mesa  [Mitc79],  Task  Foroes  [Jone79]  and  PCL  [Less79])  do  not 
enforce  a  complete  separation.  In  eaoh  of  these  languages/systems  a  process 
may  specify  the  creation  of  new  processes  in  its  description.  Thus,  while  an 
abstract  view  of  the  communication  environment  is  available,  neither  the 
operating  system  nor  a  person  reasoning  about  the  program  may  rely  on  the  com¬ 
pleteness  of  this  view.  In  PRONET,  the  conditions  and  activities  associated 
with  any  structural  modification  of  a  communication  environment  (including 
process  creation)  must  be  stated  in  a  network  specification. 

A  network  specification  describes  the  initial  configuration  of  a 
distributed  program,  in  terms  of  processes  to  be  oreated  and  the  communication 
connections  among  them,  and  it  describes  the  evolution  of  the  network  of 
processes  in  response  to  events.  The  event  handling  capabilities  of  network 
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specifications  are  the  key  to  providing  for  a  centralized  expression  of 
processes  interactions.  Message  transmissions  are  events  which  may  be  handled 
in  a  specification;  a  processes  may  also  explicitly  announce  an  event  in  order 
to  suggest  action  by  an  event  handler  in  its  overseeing  network  specification. 
Network  specifications  rely  on  a  distributed  data  management  system  to 
maintain  information  about  resource  availability  and,  hence,  the  activities 
expressed  in  a  network  specification  can  be  performed  in  a  decentralized 
fashion.  The  distributed  data  manager  enforces  serialization  of  the  execution 
of  event  handlers,  so  network  specifications  need  not  be  implemented  as 
processes.  Thus  an  event  handler  can  be  executed  as  part  of  the  process  which 
caused  its  invocation  and  the  overall  structure  of  a  distributed  program  can 
be  thought  of  as  a  tree  of  specifications  and  processes,  with  processes  only 
appearing  at  the  leaves.  As  a  result  of  this  structure,  there  is  no  single 
critical  point  whose  failure  can  halt  the  operation  of  an  entire  program. 


The  failure  handling  features  of  PRONET  are  intended  to  provide  a 
capacity  for  continued  execution  in  the  presence  of  mechanical  failures  and 
the  possibility  of  recovering  portions  of  a  program  that  may  have  been  affec¬ 
ted  by  such  a  failure.  An  additional  goal  was  that  the  failure  handling 
features  should  only  impact  execution  costs  to  the  extent  that  they  are  used 
in  a  program.  In  order  to  accomplish  these  objectives,  PRONET  uses  the 
concepts  of  permanent  processes  and  stable  storage.  The  features  available 
support  buffered  communication  (rather  than  remote  procedure  call)  in  an 
unreliable  environment  and  make  it  possible  for  a  programmer  to  ensure  that 
the  external  behavior  of  a  prooess  is  consistent  with  its  internal  state,  even 
in  the  presence  of  failures. 


D.2  JfflEIHIIIQHa  SL  mum 

The  following  definitions  of  failure,  error  and  fault  are  presented  by 


Randell,  Lee  and  Treleaven  [Rand78]: 

"When  the  behavior  [of  a  system] 


a  systemj  deviates  from  that  whloh  is 
specified  for  it,  this  is  called  a  failure.  A  failure  is  thus  an 
event...  We  term  an  internal  state  of  a  system  an  erroneous  state 
when  there  exist  circumstances  (within  the  specification  of  the 
use  of  the  system)  in  which  further  processing,  by  the  normal 
algorithms  of  the  system,  will  lead  to  a  failure  which  we  do  not 
attribute  to  a  subsequent  fault.  ...The  term  error  is  used  to 
designate  that  part  of  the  state  which  is  inoorreot.  ...A  fault 
is  the  mechanical  or  algorithmic  cause  of  an  error." 


Clearly,  faults,  and  hence  failures,  can  be  encountered  in  any  programming 
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environment  during  the  execution  of  any  application. 

The  failure  handling  features  of  PRONET  are  based  on  a  separation 
between  algorithmic  and  mechanical  failures  and  an  assumed  ability  to  detect 
and  classify  all  occurrences  of  "failures".  Considering  the  general  defini¬ 
tion  of  failure  presented  above,  an  ability  to  detect  and  classify  all 
occurrences  of  failures  is  clearly  infeasible.  Hence,  the  failure  handling 
features  of  PRONET  are  based  on  a  limited  view  of  failure.  In  this  limited 
view,  a  mechanical  failure  occurs  when  a  hardware  component  (processor,  storage 
device  or  communication  link)  has  failed  to  perform  in  accordance  with  its 
specified  behavior.  An  algorithmic  failure  occurs  when  an  executing  process 
performs  a  primitive  operation  with  an  invalid  operand  (e.g. ,  integer 
division  with  a  zero-valued  divisor  or  pointer  dereference  with  a  nil-valued 
pointer ) . 

The  distinction  between  algorithmic  and  mechanical  failures  is 
introduced  to  capture  differences  in  the  durations  and  causes  of  failures. 
Algorithmic  failures  are  presumed  to  be  permanent  and  a  result  of  faulty 
programming.  Hence,  detected  algorithmic  failures  lead  directly  to  the 
termination  of  processes  in  which  they  occur.  Mechanical  failures,  on  the 
other  hand,  are  expected  to  be  transient  and  a  result  of  a  fault  in  the  under¬ 
lying  programming  environment  (i.e. ,  a  processor  crash  or  a  communication  link 
failure).  Mechanical  failures  do  not  lead  to  the  termination  of  long-lived 
processes  but  may  temporarily  limit  their  availability. 


D.3  SSmm.  COMMUNICATION  AE22.  FAILURES 

In  a  perfect  programming  environment,  the  send  operation  of  buffered 
communication  might  be  viewed  as  passing  the  responsibility  for  processing 
messages  to  receiving  processes.  In  this  way,  processes  that  declare  input 
ports  accept  the  responsibility  for  correctly  processing  all  messages  that  are 
sent  to  these  ports.  Processes  that  send  messages  can  rely  on  the  specified 
behavior  of  receiving  processes  to  ensure  that  their  messages  are  handled 
correctly  and  completely. 


Clearly  this  interpretation  of  buffered  communication  is  inappropriate 
when  processes  can  encounter  failures  during  their  executions.  The  initial 
extensions  developed  for  using  CLU  [Lisk77]  in  distributed  programming 
environments  [Lisk79]  were  based  on  buffered  communication  primitives.  More 
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recently  [Lisk80],  a  remote  procedure  call  (RPC)  primitive  has  been  adopted: 

"RPC  is  a  very  high  level  primitive. . .  For  some  time  we  were 
hopeful  that  there  might  be  an  intermediate  level  primitive  that 
would  solve  many  of  the  user's  problems,  and  would  not  be  as 
expensive  as  RPC.  Our  experience  indicates  that  there  is  no  such 
primitive,  we  have  looked  for  one  and  have  not  found  it. " 


Much  of  the  rational  for  selecting  RPC  which  is  presented  in  [Lisk80,  Lisk82] 
is  based  on  an  inability  to  resolve  the  semantics  of  intermediate  level 
primitives  (e.g. ,  buffered  communication)  with  potential  failures  and  the 
intended  area  of  application  of  the  language  features  being  developed. 


PRONET  is  based  on  an  interpretation  of  buffered  communication  which  is 
sufficiently  powerful  to  aid  programmers  in  their  task  of  describing  inter¬ 
process  communication,  yet  weak  enough  to  allow  for  the  possibility  of 
failure.  The  send  operation  of  PRONET  completes  successfully  when  the  message 
being  sent  has  been  correctly  copied  into  the  address  space  of  all  receiving 
processes  and  all  events  which  are  generated  by  the  send  have  been  handled. 
Further,  the  send  operation  is  atomic  with  respect  to  failures— either  all 
events  associated  with  the  message  transmission  are  handled  completely  or  none 
of  these  events  is  handled  (and  a  failure  indication  is  returned).  Hence, 
after  successfully  completing  a  send  operation,  the  sending  process  can  assume 
that  receiving  processes  will  handle  the  message  in  an  appropriate  fashion. 
Receiving  processes,  on  the  other  hand,  accept  the  responsibility  (in  conjunc¬ 
tion  with  the  network  specification  that  oversees  their  operation)  for  handl¬ 
ing  all  messages  that  are  available  on  their  input  ports. 


The  crucial  distinction  between  this  Interpretation  and  the  interpreta¬ 
tion  presented  earlier,  involves  the  substitution  of  the  words  "handling"  and 
"appropriate"  for  the  words  "processing"  and  "correct"  respectively.  In  some 
applications,  under  certain  circumstances,  appropriate  handling  of  a  message 
may  involve  ignoring  the  message  entirely.  Because  of  this  necessarily  weak 
interpretation  of  buffered  communication,  sender  processes  that  need  to  know 
how  their  messages  were  (or  will  be)  handled  will  need  an  alternative  means  of 
obtaining  this  information.  For  most  applications,  a  simple  response  port 
will  suffice.  Clearly  this  complicates  the  description  of  such  processes  but 
processes  that  do  not  require  this  information  will  not  incur  additional 


costs. 
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D.4  FAILDRE  HANDLING 

An  important  motivation  for  introducing  failure  handling  facilities  into 
the  design  of  PRONET  was  based  on  the  need  to  describe  long-lived  objects. 
PRONET  does  not  provide  an  inherent  distinction  between  long-lived  and 
transient  objects — all  objects  are  processes.  However,  it  is  necessary  to 
distinguish  between  the  processes  in  a  logical  network  that  are  capable  of 
surviving  mechanical  failures  and  those  whose  activities  are  aborted  when  they 
are  in  the  scope  of  a  mechanical  failure.  The  activity  permanent  can  be 
applied  to  processes  and  provides  a  capacity  to  survive  mechanioal  failures. 

"Permanence"  is  an  inherited  property.  When  a  "non- permanent"  network 
applies  the  permanent  activity  to  one  of  its  processes,  this  activity  has  no 
immediate  affect.  Whenever  a  logical  network  becomes  "permanent",  all  proces¬ 
ses  in  the  network  to  which  the  permanent  activity  has  been  applied  will  also 
become  permanent. 

If  any  process  in  a  "non- permanent"  network  encounters  a  failure 
(algorithmic  or  mechanical)  during  its  execution,  the  entire  network  fails  and 
all  processes  in  the  network  are  terminated.  In  this  way,  failures 
encountered  by  processes  are  propagated  to  their  overseeing  network.  Propaga¬ 
tion  of  a  failure  continues  until  a  "permanent"  network  (or  process  in  the 
case  of  mechanical  failures)  is  encountered.  Failures  encountered  by  proces¬ 
ses  executing  in  a  "permanent"  network  do  not  directly  affect  other  processes 
executing  in  this  network. 

'Permanent'  processes  can  be  explicitly  terminated  or  removed  (by  their 
overseeing  network  specification)  and  can  express  their  own  termination  but 
will  be  recovered  (as  described  earlier)  if  they  have  not  terminated  and  are 
in  the  scope  of  a  mechanical  failure. 

Because  mechanical  failures  can  alter  the  internal  state  of  any  execut¬ 
ing  process,  processes  in  the  scope  of  a  mechanical  failure  cannot  rely  on 
information  stored  in  their  internal  state  after  a  mechanical  failure  has 
occurred.  A  stable  storage  facility  has  been  integrated  into  ALSTEN  to  enable 
the  description  of  processes  that  must  rely  on  portions  of  their  Internal 
state  when  mechanical  failures  are  recovered.  Like  the  facility  proposed  in 
[Lisk79],  process  descriptions  interface  to  stable  storage  by  declaring  stable 
variables  and  periodically  "checkpointing"  the  values  of  these  variables. 
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When  a  mechanical  failure  is  detected,  all  processes  in  the  scope  of  the 
failure  are  halted  before  they  can  begin  a  new  checkpoint  operation.  When  the 
mechanical  failure  is  recovered,  each  permanent  process  halted  by  the  failure 
is  restored  using  values  saved  by  checkpointing  and  non-permanent  processes 
are  removed  from  the  logical  network. 

The  features  presented  thus  far  are  useful  for  describing  long-lived 
processes  but  do  not  enable  receiving  processes  to  ensure  that  all  messages 
which  have  been  successfully  delivered  to  their  IPC  space  are  handled  in  an 
appropriate  fashion  when  a  mechanical  failure  occurs.  An  important  problem  is 
that  messages  will  be  inserted  into  the  IPC  space  of  a  process  asynchronously 
and  hence,  a  process  cannot  use  inline  checkpointing  operations  to  ensure  that 
all  messages  which  have  been  delivered  will  survive  mechanical  failures.  For 
this  reason,  ALSTEN  provides  stable  ports.  Any  input  port  declared  by  a 
process  may  be  declared  with  the  attribute  stable.  All  messages  which  have 
been  successfully  delivered  to  a  stable  input  port  and  not  removed  by  the 
receiving  process  during  its  execution  will  be  available  after  a  mechanical 
failure.  Messages  are  only  removed  from  a  stable  input  port  when  the  process 
performs  a  checkpoint  operation. 

D.5  fBBMAHEMCE  AM  EUBMALLI  visible  behavior 

The  use  of  checkpoints,  stable  variables  and  recovery  descriptions  are 
sufficient  to  describe  a  consistent  recovery  from  mechanical  failures,  but  do 
not  enable  the  programmer  to  ensure  that  the  recovered  state  is  consistent 
with  the  externally  visible  behavior  of  the  process.  In  [Lisk80]  it  is  argued 
that  many  applications  will  need  a  capacity  to  Incorporate  ’permanence  of 
effect'  in  their  communications.  Using  buffered  communication,  this  property 
would  allow  receiving  processes  to  rely  on  the  information  contained  in  the 
messages  they  receive.  Hence,  ALSTEN  provides  a  checkpointing  send  operation 
which  combines  both  operations  into  a  single  operation  and  is  atomic  with 
respect  to  mechanical  failures. 

d.6  Fimiiomro  failures 

The  failure  handling  features  described  thus  far  are  primarily  aimed  at 
handling  point  failures  (the  failure  of  a  single  process).  A  reasonable 
implementation  of  PRONET  would  be  based  on  a  partitioned  and  decentralized 
network  representation.  As  such,  mechanical  failures  could  cause  portions  of 
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the  network  representation  to  be  unavailable  for  use.  Thus  portions  of  the 
network  representation  may  not  be  visible  during  the  execution  of  an  event 
handling  clause. 

Modifications  performed  by  an  event  clause  execution  may  implioitly 
affect  objects  in  inaccessible  portions  of  the  logical  network  representation, 
even  though  the  objects  explicitly  modified  by  the  event  clause  were  available 
for  use.  Consider  that  port  *p1*  is  connected  to  port  ’P2*  and  that  p2  is 
available  but  that  pi  is  inaccessible.  In  this  situation  the  activity 
"disconnect  p2"  can  be  performed  but  will  affect  pi ,  as  pi  must  see  the 
disconnection. 

When  a  mechanical  failure  that  has  caused  a  partitioning  failure  is 
recovered,  portions  of  the  logical  network  representation  will  need  to  be 
updated  (merged)  to  reflect  modifications  performed  in  other  partitions.  In 
order  to  perform  this  merging  of  visibility  partitions,  redundant  information 
must  be  stored  in  the  logical  network  representation.  In  general  this  redun¬ 
dant  information  will  be  stored  in  the  form  of  back-pointers  which  can  also  be 
used  for  efficient  traversal  of  the  logical  network  representation. 

D.7  SUMMARY 

The  important  concepts  developed  in  PRONET  are  based  on  the  separation 
of  connectivity  specifications  from  process  descriptions.  This  separation 
allows  process  descriptions  to  be  independent  of  one  another,  since  they  can 
only  describe  interactions  with  the  other  components  of  a  program  through  mes¬ 
sages  sent  to  locally  declared  ports  and  by  announcing  events.  Thus  a  program¬ 
mer  concentrates  on  the  logical  structure  of  a  program  and  need  not  be  concer¬ 
ned  with  such  things  as  physical  distribution  considerations.  The  hierar¬ 
chical  structure  of  a  PRONET  program,  consisting  of  processes  and  a  tree  of 
overseeing  network  specifications,  is  particularly  well-suited  as  a  descrip¬ 
tion  of  a  distributed  program.  Important  features  of  PRONET  allow  continued 
execution  of  unaffected  parts  of  a  program  in  the  presence  of  failure  and 
recovery  of  failed  processes  through  use  of  checkpointing  and  stable  ports. 
Finally,  PRONET  includes  an  intermediate  level  communication  approach, 
buffered  communication,  which  operates  meaningfully  in  the  presence  of 
failures.  It  will  thus  allow  the  exploration  of  the  appropriateness  of  com¬ 
munication  protocols  other  than  remote  procedure  call  for  the  implementation 
of  realistic  distributed  programs. 
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An  initial  lnplementation  of  PRONET  on  a  single  processor  has  been  com¬ 
pleted.  Current  plans  are  for  a  more  complete  Implementation  to  be  developed 
to  run  on  a  network  of  Perq  workstations.  As  part  of  the  Clouds  operating 
system  projeot  [Mcke82]  a  real-time  distributed  data  management  system  is 
being  designed  [Allo82]  which  should  greatly  simplify  the  implementation  of 
PRONET  and  improve  its  performance. 
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APPENDIX  E 

SOFTWARE  FAULT  TOLERANCE: 

OVERVIEW  OF  THE  RECOVERT  BLOCK  SCHEME 

Tom  Wilkes 

E.1  INTRODUCTION 

Ever  since  the  first  computing  systems  were  designed  and  built,  the 
problem  of  the  reliability  of  these  systems  in  the  face  of  faults  and  errors 
has  been  a  concern  of  designers  and  researchers.  However,  until  approximately 
the  last  deoade,  most  work  on  system  reliability  has  been  focused  on  the  area 
of  hardware  reliability,  even  though  any  non-trivial  software  system  is  more 
complex  by  several  orders  of  magnitude  than  the  machine  on  which  it  runs.  As 
Randell  notes  ([Rand75]),  a  simulator  for  a  certain  machine  written  at  the 


level  of  detail  required  by  the  hardware  designers  is  in  general  many  times 
smaller  than  the  operating  system  for  that  machine.  Since  the  number  of  pos¬ 
sible  internal  states  of  any  but  the  most  trivial  software  far  outnumbers  the 
number  of  possible  states  of  the  hardware  on  whioh  it  runs,  the  possibility  of 
design  error  in  the  software  is  correspondingly  greater.  Hence,  the  need  for 
methods  of  recovery  from  design  flaws  in  software  is  at  least  as  pressing  as 
that  for  hardware. 

Also  in  [Rand75],  Randell  states: 

"If  all  design  inadequacies  could  be  avoided  or  removed  this  would 
suffice  to  achieve  software  reliability...  Indeed  many  writers 
equate  the  terms  "software  reliability"  and  "program  correctness". 
However,  until  reliable  correctness  proofs  (relative  to  some 
correct  and  adequately  detailed  specification),  which  cover  even 
implementation  details,  oan  be  given  for  systems  of  a  realistic 
size,  the  only  alternative  means  of  increasing  software 
reliability  is  to  incorporate  provisions  for  software  fault 
tolerance. " 

As  Svobodova  has  noted  ( [Svob79] ) ,  distributed  systems  have  an  even  greater 

potential  for  providing  reliability  than  their  non-dlstributed  counterparts: 

"Distributed  systems  are  often  olaimed  tc  be  inherently  more 
reliable  than  systems  based  on  a  large  central  processor.  That 
is,  given  that  a  distributed  system  is  properly  designed,  it 
offers  better  reliability.  First,  distributed  systems  by  their 


very  nature  provide  opportunities  for  redundancy.  Second,  error 
propagation  is  restricted  by  physical  separation  of  processes  and 
resources.  And  finally,  Individual  nodes  in  the  distributed 
system  may  be  less  oomplex  than  a  large  oentral  processor  and,  as 
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a  result,  ought  to  have  lower  probability  of  failures.  Basically, 
distributed  systems  have  a  potential  for  being  more  reliable  than 
systems  based  on  a  large  oentral  processor.  However,  this 
potential  needs  to  be  exploited  through  proper  design. " 

PRONET,  a  language  for  distibuted  processing  applications  which  has  been 
under  development  at  Georgia  Tech  ([Maco82])  incorporates  extensive  facilities 
for  dealing  with  the  problem  of  hardware  failures.  However,  the  work  to  date 
on  the  design  of  PRONET  does  not  treat  the  problem  of  software  (algorithmic) 
failures.  Algorithmic  failures  present  a  muoh  more  difficult  problem  than 
hardware  failures.  Because  such  failures  presumably  result  from  a  logical 
fault  in  the  program,  use  of  checkpointing  and  restarting  will  only  result  in 
a  reproduction  of  the  failure.  (In  the  case  where  a  hardware  failure  corrup¬ 
ted  data  and  thus  caused  the  algorithmic  failure,  such  techniques  may  provide 
a  means  of  recovery.)  Thus  some  capability  to  execute  alternative  code  is 
required,  as  well  as  some  capability  to  undo  the  effects  of  the  code  which  has 
failed.  The  addition  of  these  capabilities  to  a  distributed  system  will 
Increase  the  complexity  of  programming  in  the  system,  since  processes  may 
interaot  in  the  recovery  mode  and  during  the  "undo"  process,  as  well  as  during 
their  normal  execution.  As  Shrlvastava  and  Banatre  have  noted  ([Shrl78]), 

"...appropriate  programming  language  tools  must  be  provided  to 
cope  with  this  additional  complexity  in  a  systematic  manner, 
otherwise  resulting  programs  are  likely  to  be  even  less  reliable 
than  versions  with  no  redundancy. " 


It  is  in  support  of  the  design  of  such  tools  that  the  present  survey  is  being 
undertaken. 


In  an  excellent  review  art '.ole  ([Rand78])f  Randell,  Lee,  and  Treleaven 
have  surveyed  the  issues  of  hardware  and  software  reliability,  and  have 
catalogued  current  techniques  for  error  recovery  and  fault  tolerance.  A 
repetition  of  their  work  will  not  be  attempted  here.  Rather,  the  results  of 
their  survey  will  be  summarized,  and  two  important  techniques  for  software 
fault  tolerance  the  so-called  forward  and  backward  error  recovery  methods 
—  will  be  briefly  contrasted.  However,  the  bulk  of  the  discussion  will 
center  on  a  particular  backward  error  recovery  scheme,  the  recovery  block 
method,  which  is  discussed  in  [Rand75],  [Rand78],  [Ande8l],  and  many  other 
publications  which  have  issued  from  the  software  fault  tolerance  project  at 
the  University  of  Newcastle  upon  Tyne  (most  of  which  will  be  discussed  below). 
Recent  publications  which  consider  the  application  of  these  recovery  tech¬ 
niques  to  distributed  computing  systems  will  also  be  discussed. 
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E.2  SOME  TERMINOLOGY 

In  [Rand78],  definitions  for  many  terms  used  in  the  discussion  of 
software  fault  tolerance  have  been  provided  which  have  been  adopted  by  sub¬ 
sequent  papers  from  the  project  at  Newcastle  upon  Tyne  and  also  by  several 
other  authors  in  the  field.  For  convenience,  some  of  these  definitions  are 
reproduced  here: 

"The  reliability  of  a  system  is  taken  to  be  a  measure  of  the  suc¬ 
cess  with  which  the  system  conforms  to  some  authoritative 
specification  of  its  behavior... 

When  the  behavior  of  a  system  deviates  from  that  which  is 
specified  for  it,  this  is  called  a  failure... 

We  term  an  Internal  state  of  a  system  an  erroneous  state  when  the 
state  is  such  that  there  exist  circumstances  (within  the 
specification  of  the  use  of  the  system)  in  which  further  proces¬ 
sing,  by  the  normal  algorithms  of  the  system,  will  lead  to  a 
failure  which  we  do  not  attribute  to  a  subsequent  fault...  The 
term  "error"  is  used  to  designate  that  part  of  the  state  which  is 
"incorrect"... 

A  fault  is  the  mechanical  or  algorithmic  cause  of  an  error,  while 
a  potential  fault  is  a  mechanical  or  algorithmic  construction 
within  a  system  such  that  (under  some  circumstances  within  the 
specification  of  the  system)  the  construction  will  cause  the 
system  to  assume  an  erroneous  state..." 

Note  that,  using  the  definitions  of  "fault"  and  "error"  given  above,  the 
method  of  repair  of  faults  and  errors  in  a  system  is  very  different.  In 
particular,  the  repair  of  a  fault  in  a  software  component  is  a  complex  task 
which  would  be  very  difficult  to  automate,  and  which  should  be  accomplished  by 
manual  means  in  an  unharried  manner.  Repair  of  an  error,  on  the  other  hand, 
entails  the  change  of  the  erroneous  state  into  one  in  which  processing  may 
continue  correctly  (within  the  specification  of  the  system),  or  the  restora¬ 
tion  of  a  previously-existing  state  which  satisfies  these  specifications.  We 
shall  see  that  this  process  is  automatable.  Thus,  repair  of  an  error  is 
required  for  continued  operation  of  a  system,  whereas  the  repair  of  the  fault 
which  caused  the  error  is  not  always  necessary  for  ensuring  continued 
operation. 

E.3  METHODS  m  HQEB1BE  fault  UMBAMCE 

As  has  been  mentioned  above,  [Rand78]  provides  a  comprehensive  survey  of 
techniques  for  hardware  and  software  fault  tolerance.  The  authors  consider 
strategies  for  error  detection,  fault  treatment,  damage  assessment,  and  error 
recovery  as  comprising  a  classification  of  **ault- tolerance  techniques.  These 
strategies  are  by  no  means  mutually  exclusive,  as  we  shall  see. 


Vv> 


;.-v- 


.  -  *•  * 

*  *  **  «.  a 


aV> 


Vv<- 


*.*  V.V  V  \  . V  V  V 


*-•  s-  s'  s’ WW •v\yv 

;..V. -w/. 

.  .  <  .V.  /  i*.  . .  .  .•  .*  V  . 


m.  u 


Page  82 


Software  Fault  Tolerance 


Appendix  B 


E.3.1  Error  Dateotlon 

As  defined  in  [Rand78], 

"the  purpose  of  error  detection  is  to  enable  system  failures  to  be 
prevented  by  recognizing  when  they  may  be  about  to  occur." 

In  order  to  fulfill  this  purpose  in  the  ideal  case,  however,  the  checks  which 
would  have  to  be  made  would  have  to  be  based  solely  on  the  system  specification, 
and  independent  of  the  actual  implementation  to  a  degree  probably  not 
realizable  in  practice.  Also,  the  extent  of  error  checking  necessary  would 
probably  fall  victim  to  performance  considerations.  Thus,  the  complete  con¬ 
fidence  afforded  by  the  ideal  case  is  generally  not  attainable,  and  some  "very 
high"  level  of  confidence  is  all  that  can  be  expected.  However,  all 
strategies  for  fault  tolerance  depend  on  error  checking  for  their  Invocation. 

E.3.2  Fault  Treatment 

Error  detection  seeks  only  to  identify  the  symptoms  of  a  fault,  but  does 
not  try  to  identify  the  particular  fault  which  caused  the  error.  The 
Identification,  location,  and  removal  of  a  fault  is  a  complex  Job,  since  many 
errors  may  be  caused  by  a  particular  fault,  a  particular  error  may  be  caused 
by  several  different  faults,  the  error  oauaed  by  a  particular  fault  only  occur 
for  certain  input  values,  etc.  Thus,  the  automation  of  the  task  of  fault 
removal  in  software  is  not  feasible  exoept  in  very  simple  cases.  However,  the 
treatment  of  faults  by  alternative  means,  suoh  as  replacement  strategies,  is 
more  tractable;  indeed,  the  reoovery-blook  scheme  whioh  is  discussed  below  is 
such  a  strategy. 

E.3.3  Daaaga  AaaeaMont 

As  noted  in  [Rand78], 

"Damage  assessment  oan  be  based  entirely  in  a  priori  reasoning,  or 
can  involve  the  system  Itself  in  activity  Intended  to  determine 
the  extent  of  the  damage.  Each  approaoh  oan  Involve  reliance  on 
the  system  structure  to  determine  what  the  system  might  have  done, 
and  hence  possibly  have  done  wrongly.  The  approach  can  be 
explained,  and  might  have  been  designed,  by  making  explicit  use  of 
atomic  actions." 

The  intent  here  is  that  atomic  aotlons  provide  a  "sequence  of  delimitations 
...  of  amounts  of  possible  damage  corresponding  to  eaoh  different  error 
detection  point."  Since,  as  the  authors  note,  damage  assessment  is  often  nec- 
cessary  to  attempts  at  error  recovery,  and  "is  usually  a  rather  uncertain  and 
incomplete  affair",  it  is  worthwhile  to  expend  the  effort  Involved  in  limiting 
the  spread  of  damage  by  such  means. 
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e.3.4  to  Recovery 

Methods  for  error  reoovery  are  divided  into  the  so-called  forward  and 
backward  automatic  recovery  schemes.  Forward  recovery  schemes  attempt  to  make 
further  use  of  the  erroneous  state.  Thus,  predictions  about  the  location  and 
consequences  of  software  faults  are  necessary.  Such  a  scheme  must  therefore 
be  designed  as  an  integral  part  of  the  system  for  which  it  is  to  provide  fault 
tolerance.  Also,  the  questions  of  damage  assessment  and  fault  treatment  are 
intermingled  with  the  question  of  how  to  continue  to  provide  service.  Despite 
the  complexity  which  such  a  scheme  adds  to  the  system  it  is  to  serve,  there 
are  situations  in  which  valid  assumptions  can  be  made  based  on  knowledge  of 
the  system  for  which  forward-recovery  techniques  provide  simple  and  effective 
error  recovery.  In  particular,  these  methods  are  very  effective  in  dealing 
with  such  situations  as  errors  caused  by  invalid  input  data.  The  exception¬ 
handling  methods  used  in  languages  such  as  PL/I  and  Ada  are  examples  of 
forward-recovery  methods. 

Backward-recovery  schemes,  on  the  other  hand,  involve  restoration  of 
what  is  hoped  to  be  an  error-free  state,  and  thus  require  no  predictions  of 
the  location  or  nature  of  faults.  Rather,  backward  recovery  is  analogous  to 
mechanical  backups  in  hardware  systems.  Information  about  the  system  state 
previous  to  the  fault  is  restored  from  a  checkpoint,  and  a  back-up  process  is 
started.  The  back-up  process  is  necessarily  not  the  same  as  the  failed 
process,  as  it  would  presumably  only  fail  again.  In  general,  the  back-up 
process  (or  processes)  Is  more  simple  than  the  original  process,  and  may 
provide  only  a  primitive  simulation  of  the  functions  of  the  original  process 
(such  as  forwarding  messages)  in  order  to  keep  a  program  running. 

The  recovery- block  scheme,  an  example  of  a  backwards-recovery  scheme 
which  has  been  the  object  of  detailed  Investigation  at  Newcastle  upon  Tyne,  is 
described  in  the  next  section. 

E.4  m  RECOVERY-BLOCK  SCHEME 

The  recovery-block  scheme  described  by  researchers  at  the  University  of 
Newcastle  upon  Tyne  ([Rand75]»  [Rand78],  [Ande8l])  is  an  example  of  a 
backwards-recovery  method.  This  method  is  a  means  of  providing  "gracefully 
degrading  software"  ([Ande8l]).  The  syntax  for  describing  a  reoovery  block 
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assure  Acceptance  test>  bv 
<original  block> 
else  by 

<back-up  block  1> 
else  bv 
•  •  • 

else  arran; 

where  some  of  the  "back-up  blocks"  may  be  simple  retries  of  previous  blocks. 

If  a  failure  occurs  in  the  original  block,  back-up  blocks  are  tried  until  one 
completes  without  failure  and  the  acceptance  test  is  satisfied,  or  else  an  error 
is  signalled.  The  back-up  blocks  may  have  to  undo  permanent  effects  made  by 
their  predecessors  before  doing  their  own  work. 

E.4.1  Acceptance  Testa 

The  function  of  the  acceptance  test  is  to  ensure  that  the  operation  per¬ 
formed  on  the  system  state  by  at  least  one  of  the  alternate  blocks  is  to  the 
satisfaction  of  the  Invoking  program.  Thus,  an  acceptance  test  need  not  be  a 
check  on  the  "absolute  correctness  of  an  operation"  ([Rand75]).  In  general,  a 
test  is  based  on  the  present  and  prior  values  of  variables  global  to  the 
alternate  blocks  and  to  the  invoking  procedure.  Also,  some  means  is  provided 
for  checking  whether  global  variables  not  accessed  within  acceptance  test  have 
been  modified,  thus  giving  a  measure  of  security  against  unforeseen  side 
effects. 

It  is  clear  that  the  careful  design  of  acceptance  tests  is  important  to 
the  success  of  the  recovery-block  method.  However,  strict  requirements  for 
correctness  must  often  yield  to  performance  considerations,  as  in  the  follow¬ 
ing  example  from  [Rand75]: 

ensure  sorted  (S)  and  (sum  (S)  =  sum  ( prior  S)) 

bv  quiekersort  (S) 

else  by  quicksort  (S) 

else  bv  bubblesort  (S) 

jglAfi  error; 

Here,  the  strict  requirement  that  the  sorting  algorithm  yield  a  permutation  of 
Its  Input  values  has  been  relaxed  to  a  requirement  that  the  sum  of  the  Input 
values  and  the  sum  of  the  output  values  be  the  same. 

B«4«2  Aft  Recovery  Cache 

Before  a  back-up  blook  may  be  tried,  the  state  of  global  objects  must  be 
restored  to  that  existing  before  the  failed  block  began  execution.  This  state 
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restoration  is  made  possible  by  the  use  of  a  recovery  cache,  in  which  the 
values  of  global  variables  are  stored  prior  to  their  first  updates  in  the 
current  block.  A  recovery  cache  is  essentially  a  differential  file,  and  is 
thus  less  costly  in  space  than  a  full  checkpoint.  Since  recovery  blocks  may 
be  nested,  the  cache  is  organized  as  a  stack;  state  restoration  for  the 
current  recovery  block  requires  restoration  of  global  variable  values  from  the 
current  top  stack  entry,  and  upon  completion  of  a  block,  the  stack  entry  is 
discarded,  thus  "committing"  the  results  of  the  block. 

Thus,  the  problems  of  state  restoration  and  recovery  for  simple  global 
variables  are  relatively  straightforward.  However,  in  general,  the  actions  of 
interacting  processes  may  be  more  complicated  than  simple  assignment  to  a 
global  variable;  there  may  be,  for  instance,  competition  for  global  resources 
(say,  peripherals)  or  cooperative  use  of  resources  for  inter-process  com¬ 
munication  (say,  a  shared  message  buffer).  As  has  been  noted  in  [Shri78],  for 
arbitrary  interaction  of  processes,  the  problem  of  the  management  of  recovery 
information  and  the  control  of  processes  may  become  extremely  complex. 
However,  they  show  that  it  is  possible  to  break  these  interactions  down  into 
different  classes  -  interference,  cooperation,  and  competition  -  and  to 
develop  mechanisms  to  treat  the  recovery  problems  posed  by  these  different 
types  of  interactions  separately. 

E.4.3  Error  Recovery  in  Cooperating  and  Competing  Processes 

The  bulk  of  [Shri78]  is  devoted  to  the  consideration  of  the  problem  of 
competing  resources.  This  problem  is  simpler  than  that  of  cooperating  resour¬ 
ces  for  the  following  reasons:  while  cooperating  processes  can  exchange 
arbitrary  information  (for  instance,  via  a  message  buffer),  competing  proces¬ 
ses  typically  exchange  only  that  information  required  to  ensure  proper  synch¬ 
ronization  and  sharing  of  resources.  Thus,  for  competing  processes,  the  type 
of  information  exchanged  is  known  to  (and  generally  controlled  by)  the  synch¬ 
ronization  mechanism. 

However,  since  the  information  exchanged  by  cooperating  processes  may  be 
arbitrary,  in  general  only  the  recipient  of  the  information  may  verify  it.  In 
[Shri78],  the  example  of  a  producer  and  a  consumer  connected  by  a  bounded  mes¬ 
sage  buffer  is  considered.  For  verification  reasons,  "production"  and 
"consumption"  of  a  message  are  programmed  as  a  conversation,  and  when  the 
producer  and  consumer  processes  enter  a  conversation,  they  are  allowed  to 
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leave  it  only  when  both  pass  their  acceptance  tests.  This  prevents  the 
producer  from  "racing  ahead"  of  the  consumer  and  thus  can  seriously  limit  the 
amount  of  concurrency  possible.  Whether  the  conversation  mechanism  may  be 
supplemented  by  some  other  mechanism  to  ameliorate  this  problem  is  currently 
under  investigation  at  Newcastle  upon  Tyne. 

E.4.4  The  Domino  Effect 

Another  problem  in  the  application  of  the  recovery  block  scheme  to 
cooperating  processes  is  the  so-called  domino  effect  ([Rand78],  [Ande8l]). 
This  effect  arises  from  attempts  by  the  individual  communicating  processes  to 
achieve  backward  recovery.  If  two  processes  independently  establish  recovery 
points  or  checkpoints,  and  communication  between  them  may  occur  at  arbitrary 
times,  then  we  may  have  the  scenario  represented  in  the  following  diagram 
([Ande8l ]): 

Process  1 :  - •§— — — -- — •£-- - — f - - - — £ - — » 

I  I  I  I  I  I  I  i 

!  I  I  I  I  I  I  I 

Process  2 :  - -£ - - - § - - — - — f — - - - 

Here,  the  vertical  lines  represent  occurrences  of  communication  between  the 
two  processes,  and  the  square  brackets  indicate  an  active  recovery  point  to 
which  the  state  of  a  process  may  be  restored.  If  process  1  experiences  a 
failure  after  its  most  recent  recovery  point,  it  may  try  to  restore  its  state 
at  that  point.  Since  it  has  not  communicated  with  process  2  since  that  point, 
process  2  need  take  no  recovery  action.  If,  however,  process  2  encounters  a 
failure  after  its  last  communication  with  prooess  1 ,  process  2  must  restore 
its  state  to  its  most  recent  recovery  point,  which  occurred  before  its  last 
communication.  Thus  prooess  1  must  be  restored  to  a  point  at  or  before 
process  2*s  recovery  point,  since  the  state  of  prooess  1  was  changed  by  the 
information  exchange  which  took  place  after  that  point.  However,  the  most 
recent  recovery  point  to  whioh  prooess  1  can  restore  occurred  before  this 
exchange,  which  will  similarly  cause  another  rollback  in  the  state  of  process 
2,  etc.  Thus,  an  uncontrolled  propagation  of  rollbacks  in  process  states  may 
occur,  much  like  a  line  of  toppling  dominos.  This  effect  does  not  occur  for 
independent,  competing  processes,  since  no  such  information  flow  occurs 
between  them. 

Thus,  a  basic  problem  in  recovery  is  the  search  for  a  "consistent"  set 
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of  recovery  points,  that  is,  for  a  set  of  checkpoints  for  which  the  domino 
effect  does  not  occur.  A  consistent  or  usable  set  of  recovery  points  is  cal¬ 
led  a  recovery  line  ([Rand78]). 

As  Randell  has  shown  ([Rand75])»  in  order  to  obtain  a  consistent  set  of 
recovery  points  for  a  group  of  freely-interacting  processes  it  is  required 
that  the  pattern  of  interactions  among  the  processes  be  known  in  advance.  As 
this  is  a  rather  unrealistic  requirement,  we  must  consider  two  alternatives 
([Verh78]):  (1)  prevent  the  interactions,  as  in  implicit  or  explicit  locking, 
or  (2)  synchronize  the  processes  with  respect  to  recovery.  We  shall  see  that, 
it  is  the  latter  route  which  has  been  chosen  by  the  group  at  Newcastle  upon 
Tyne. 

Recent  work  on  recovery  lines  reported  in  [Ande8l]  has  led  to  the  fol¬ 
lowing  definition  of  a  restorable  aotion: 

"An  atomic  action  is  said  to  form  a  restorable  action  if:  (i)  on 
entry  to  the  atomic  action  all  processes  establish  a  recovery 
point,  (ii)  these  recovery  points  are  not  discarded  within  the 
atomic  actions,  and  (iii)  processes  leave  the  atomic  action  simul¬ 
taneously  . "  ( [ Ande8 1 ] ) 

Within  a  restorable  action,  backward  recovery  may  be  accomplished  by  restoring 
the  recovery  points  established  for  each  process  upon  entry  to  the  action  if 
an  exception  is  raised  by  any  of  the  processes  involved  in  the  action.  This 
protocol  can  be  seen  to  be  equivalent  to  the  conversation  protocol  described 
above. 

Work  on  extending  the  recovery-block  method  to  cooperating  processes  is 
described  in  [Ande8l].  In  particular,  strategies  for  avoiding  the  domino 
effect  are  discussed.  As  has  been  mentioned  above,  requiring  communicating 
processes  to  enter  into  conversations  from  which  all  processes  involved  must 
exit  together  (thus  committing  the  results  of  the  conversation)  will  avoid  the 
domino  effect  at  the  cost  of  lost  concurrency,  much  as  for  the  requirement  of 
two-phase  locking  for  synchronizing  processes.  Indeed,  the  conversation 
mechanism  is  seen  to  fulfill  the  requirements  of  a  "recoverable  action"  as 
defined  above.  In  work  by  Russell  ([Russ80]),  certain  protocols  for  ordering 
message  sending  and  receiving  have  been  developed  for  whioh  it  can  be  shown 
that  uncontrolled  rollback  cannot  ocour. 

E.4.5  Recoverable  Monitors 

While  Investigating  the  simpler  problem  of  competing  processes,  however, 
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Shrivastava  and  Banatre  have  developed  language  features  to  support  recovery 
which  may  have  more  general  interest.  They  introduce  the  idea  of  a 
recoverable  monitor,  in  which  access  to  resources  is  controlled  by  a  feature 
called  a  port,  which  is  similar  to  the  class  and  "inner"  constructs  of  SIMULA 
or  Concurrent  Pascal.  Assuming  a  Concurrent  Pascal-like  language,  the  syntax 
of  a  port  construct  may  be  summarized  as  follows  ([Shri78],  [Ande8l]): 

r entry 1  type  <name>  =  port  (formal  parameters) 

"entry  is  an  optional  feature" 
begin  ...local  variable  declarations... 

...procedures/forward  entry  procedures,  e.g. :  ... 
forward  entry  procedure  <name>  (formal  parameters); 
begin  . . .  end: 

...other  procedures/forward  entry  procedures... 

( backward  entry  procedure  <name>; 

"this  procedure  is  optional" 
begin  ...  end 
si ;  inner :  s2 

"si  and  s2  are  statements,  where  si  is  the  prelude 
and  s2  is  the  postlude" 
end  "of  port  definition" 

The  organization  of  the  port  construct  reflects  (and  enforces)  a  resource- 
access  protocol  considered  in  [Shri78].  There,  the  types  of  recovery  actions 
necessary  when  failure  occurs  at  various  points  in  the  protocol  are  developed. 
In  particular,  the  protocol  requires  that  only  the  prelude  and  postlude  of  the 
port  may  acquire  and  release  resources,  respectively.  Also,  the  backward 
entry  feature  allows  specification  of  an  "undo"  block,  whose  purpose  is  to 
undo  the  effects  of  the  execution  of  the  forward  entry  blocks,  which  is  neces¬ 
sary  for  state  restoration  of  arbitrary  global  objects. 

If  failure  occurs  during  the  prelude  of  a  port  (si),  this  means  that  all 
alternatives  of  the  resource-acquisition  block  have  failed,  and  thus  the  port 
fails.  If  failure  occurs  after  acquisition  and  before  use  of  a  port  (between 
si  and  inner) f  then  to  restore  the  state  of  the  (abstract)  port,  the  only 
action  required  is  the  release  of  the  acquired  resource;  thus,  the  postlude 
(s2)  must  be  executed.  The  use  of  a  resource  ( inner )  is  considered  to  be  an 
atomic  call  on  a  recovery  block,  and  if  failure  is  signalled,  then  only  execu¬ 
tion  of  the  postlude  is  required.  If  failure  is  detected  after  the  use  but 
before  the  release  of  the  resource,  then  the  backwards  procedure  must  be 
executed  to  undo  the  effeots  of  the  user  procedure,  and  then  the  postlude  must 
be  executed.  A  failure  during  the  postlude  (s2)  means  that  it  was  not  pos- 
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sible  to  release  the  acquired  resource  —  an  unrecoverable  error. 


E.4.6  Effects  on  Software  Complexity 

As  has  been  noted  above,  forward-recovery  methods  must  be  designed  as  an 
integral  part  of  the  software  which  they  are  to  serve.  Thus,  they  may  add 
significantly  to  the  complexity  of  this  software.  In  contrast,  the  recovery- 
block  scheme  provides  a  means  of  explicitly  separating  the  error-detection  and 
recovery  functions  from  the  rest  of  the  software,  and  thus  should  add  little 
to  the  conceptual  complexity  of  a  module.  In  addition,  it  is  possible  (and 
indeed  desirable)  that  the  design  of  any  back-up  blocks  provided  proceed 
independently  of  the  primary  block  and  of  each  other.  This  independence  of 
the  alternative  blocks  may  produce  a  significant  reduction  in  the  complexity 
of  software  employing  the  recovery-block  method  as  compared  to  software  using 
ad  hoc  error  detection  and  recovery  methods  ([Ande8l]).  Also,  the  requirement 
of  acceptance  tests  is  more  rigidly  enforced  than  the  use  of  assertions  in 
some  systems,  thus  providing  an  enforced  verification  method. 


E.4.7  la  Implementation  XfiC  Distributed  fiyftljftBfl 

Several  problems  crop  up  in  the  implementation  of  backwards-recovery 
schemes  for  loosely-coupled  distributed  systems  under  decentralized  control 
which  are  not  apparent  in  implementation  for  non-distributed  systems 
([Ande8l]).  Cooperating  processes  in  such  systems  must  exchange  control 
information  in  addition  to  exohanging  data  in  order  to  coordinate  the  recovery 
process  in  the  absence  of  a  central  coordinator.  In  an  unsafe  message-passing 
system,  there  may  be  significant  delay  between  the  sending  and  reception  of 
these  control  messages,  or  they  may  become  corrupted  or  lost.  This  adds 
greatly  to  the  complexity  of  the  recovery  problem. 


If  a  distributed  recovery  system  relies  on  planned  recovery  lines,  there 
is  a  need  for  coordination  of  the  exits  of  processes  from  restorable  actions 
in  order  to  insure  the  existence  of  these  reoovery  lines.  This  necessitates 
the  existence  of  a  central  coordinator,  such  as  that  in  System  R  (see  below), 
which  governs  a  two-phase  commit  protocol  not  unlike  the  conversation 


mechanism  discussed  above. 


A  system  may  instead  search  for  unplanned  recovery  lines.  Such  a  system 
is  studied  in  the  ocourrenoe  graph  scheme  ([Merl78]).  An  occurrence  graph  is 
a  historical  record  of  the  dependencies  between  communicating  processes  due  to 
the  information  flow  between  them.  Such  a  reoord  is  kept  by  each  process  in 
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the  system.  Should  a  process  need  to  restore  a  recovery  point,  it  must  send  a 
FAIL  message  to  those  dependent  processes  as  given  by  the  occurrence  graph  In 
order  to  maintain  the  consistency  of  the  system  state.  Each  process  which 
receives  a  FAIL  message  must  cease  its  normal  activity  and  also  send  out  FAIL 
messages  to  all  of  its  dependents.  The  assumption  must  be  made  that  the  FAIL 
messages  propagate  faster  than  normal  messages  (and  that  none  of  them  are 
lost).  In  this  way  a  recovery  line  may  be  eventually  identified.  This  scheme 
is  known  as  the  chase  protocol.  Unfortunately,  recent  investigations  have 
shown  that  this  method  is  highly  prone  to  the  domino  effect  ([Ande8l]). 


C.p  UTrLKft  BAUKHAUDS-HECOYEKI  SCHEMES 

Several  recovery  schemes  which  bear  similarities  to  the  recovery  block 
scheme  have  been  discussed  in  the  literature  (see  [Ande8l]).  System  R,  an 
experimental  data-base  system  ([Kohl8l],  [Gray8l]),  employs  a  BD0-UND0-RED0B 
system  for  treatment  of  hardware  failures  via  maintenance  of  an  incremental 
log  with  write-ahead.  A  centralized  "coordinator"  controls  a  two-phase  commit 
protocol,  and  independence  of  actions  is  required  to  avoid  the  domino  effect. 
The  REDO  of  an  action  is  effective  only  for  idempotent  actions  (i.e. ,  those 
for  which  multiple  executions  are  valid),  and  the  System  R  scheme  is  thus  less 
powerful  than  the  alternative- block  strategy  of  the  recovery-block  method. 
Another  similar  method  is  the  deadline  mechanism  for  real-time  systems,  where 
the  acceptance  test  of  the  recovery-block  scheme  is  replaced  by  a  time-out 
test.  Yet  another  fault-tolerance  method  is  the  so-called  N-veralon  aoheme, 
in  which  the  results  of  applying  several  different  algorithms  to  the  solution 
of  a  problem  are  compared  for  agreement. 

E.6  UNIFIED  HE*  ££  PROGRAMMED  USD.  AUTOMATIC  EXCEPTION  HAMBURG 

In  a  recent  paper  from  Newcastle  upon  Tyne  ([Cris82]),  Cristian 
initiates  the  development  of  a  formal  view  of  the  concepts  underlying  software 
fault  tolerance  in  order  to  elucidate  the  unity  between  programmed  exception 
handling  and  default  exception  handling  using  automatic  backwards  recovery. 
Also,  his  formal  development  demonstrates  the  existence  of  a  class  of  design 
faults  which  cannot  be  treated  using  automatic  methods  such  as  the  recovery- 
block  method. 


Cristian  bases  his  model  on  a  view  of  programs  as  a  hierarchy  of 
modules,  assuming  that  this  structure  is  the  result  of  the  application  of  data 
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abstraction  techniques  to  program  development.  Thus,  a  user  would  view  a 
module  M  as  an  abstract  variable  of  some  abstract  data  type,  that  is,  a  set  of 
abstract  states  and  transitions  between  these  states  (produced  by  the 
operations  exported  by  M).  The  internal  structure  of  M  (not  visible  to  the 
user)  is  a  set  of  state  variables  and  procedures  which  operate  on  these 
variables. 

The  internal  state  of  the  module  M  is  defined  as  the  aggregation  of  the 
abstract  states  of  the  state  variables  of  M.  The  abstract  state  of  M  is  the 
result  of  applying  an  abstraction  function  A  to  the  internal  3tate  of  the 
module  M.  Note  that  this  definition  is  recursive;  the  state  variables  of  M 
may  themselves  be  the  abstract  states  of  lower  level  modules.  Presumably, 
however,  the  recursion  bottoms  out  in  the  lowest  level  modules,  where  the 
state  variables  are  actual  data  structures. 

The  abstract  state  of  a  module  is  in  general  a  partial  function  defined 
only  over  some  set  of  internal  states  which  satisfy  an  invariant  predicate  I. 
The  states  which  satisfy  this  predicate  are  said  to  be  consistent  with  the 
abstraction  which  is  supposed  to  be  implemented  by  the  module.  However,  a 
module  may  during  execution  pass  through  states  which  do  not  satisfy  this 
invariant  predicate,  and  thus  for  which  the  abstract  state  is  not  defined. 

The  Intended  service  of  a  procedure  P  exported  by  the  module  M  is 
specified  by  a  relation  post  over  pairs  of  initial  and  final  states  (s',s)  of 
the  state  transition  accomplished  by  the  procedure.  A  pair  of  states  (s',s) 
is  said  to  be  in  post  if  the  final  state  s  is  the  intended  outcome  of  invoking 
the  procedure  P  in  the  initial  state  s’.  The  characteristic  predicate 
associated  with  the  relation  post  is  called  the  standard  postoondltion  of  P. 

The  standard  domain  (SD)  of  a  procedure  P  is  defined  as  that  set  of 
initial  states  s'  for  which  execution  of  P  terminates  normally  in  states  s 
such  that  post  (s',s)  holds.  If  P  is  invoked  in  an  Initial  state  s'  outside 
its  standard  domain  SD,  an  exception  ooours.  Such  states  s'  belong  to  the  ex¬ 
ceptional  domain  (ED)  of  P,  that  is,  the  set  of  states  which  do  not  belong  to 
the  SD  of  P. 

To  illustrate  these  concepts,  Cristian  presents  the  following  short 
example.  Let  intended  service  for  some  procedure  P  exported  by  a  module  M  be 
specified  by  post  ==  i  =  i»  +  J*,  where  i'  and  j*  denote  the  initial  values  of 
state  variables  i  and  j  of  M,  whioh  are  of  type  positive  integer.  If  the  body 
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of  the  procedure  P  is 

i  :=  1  +  j 

and  PI  is  the  set  of  machine-representable  positive  integers,  then  the  stan¬ 
dard  domain  SD  of  P  is  i*  +  j'  in  PI,  and  the  exceptional  domain  ED  of  P  is  i' 
+  j'  not  in  PI.  Had  the  programmer  by  mistake  typed  instead  of  "+"  in  the 

body  of  P,  then  the  SD  and  ED  of  P  would  be 

SD  ==  (i»  =  j»)  and  (i»  =  0  or  i*  =  2)  ED  =  “SD 

A  programmer,  in  the  design  of  a  procedure,  may  anticipate  that  the  procedure 
may  be  invoked  in  initial  states  outside  its  standard  domain,  i.e.  in  its 
exceptional  domain.  The  programmer  may  detect  such  anticipated  exception  oc¬ 
currences  by  such  means  as  run-time  checks.  However,  in  general  such  checks 
may  be  redundant,  since  the  condition  which  the  check  is  supposed  to  detect 
may  be  detected  by  the  hardware  before  the  check  can  be  executed.  Also,  in 
general  it  does  not  make  sense  to  continue  normal  execution  of  a  program  after 
such  a  condition  becomes  apparent.  Thus  some  languages  contain  features 
allowing  the  programmer  to  express  actions  to  be  undertaken  in  place  of  normal 
execution  upon  an  exception  occurrence.  Such  features  are  termed  exception 
mechanisms. 

As  an  example  of  the  explicit  programming  of  handlers  for  exception 
occurrences,  Cristian  gives  the  following: 

■Btpc  p  Simula.  ow; 

begin 

i  :=  i  +  J  [OV  ->  signal  OW]; 

i  :=  i  +  k  [OV  ->  i  :=  i  -  j;  signal  OW: 

.end.; 

Here,  the  first  line  of  the  example  expresses  the  existence  of  two  exit  points 
from  the  procedure  P:  the  normal  exit,  and  another  exit  on  occurrence  of  the 

exception  OW.  On  the  third  line,  if  the  addition  causes  an  overflow  exception 

(OV),  a  handler  which  merely  signals  the  exception  OW  to  the  invoking 
procedure  is  executed.  Note  that,  if  the  addition  causes  overflow,  the 
assignment  is  not  executed,  and  thus  the  initial  state  remains  unchanged. 
Similarly,  on  line  four  an  OV  exception  will  cause  execution  of  a  handler 
which  undoes  the  effect  of  the  preceding  line  (by  subtracting  the  value  which 
was  added  there),  and  then  signals  OW.  This  has  the  effect  of  restoring  the 
initial  state. 
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The  standard  postcondition  for  this  procedure  may  be  expressed  by  post 
==  i  =  i*  +  j *  +  k' .  However,  if  the  exception  OW  is  signalled  by  P  (i.e. ,  if 
it  uses  its  exceptional  exit),  then  its  exceptional  postcondition 

post  (OW)  ==  (i  =  i • )  and  (j  =  j')  and  (k  =  k’) 

is  satisfied.  In  general,  if  E  is  an  exception  signalled  by  a  procedure  P, 
then  post  (E)  specifies  the  intended  state  transition  when  P  signals  E. 

The  procedure  P  giver  above  is  said  to  be  total,  since  its  behavior  .1^ 
specified  (by  means  of  its  standard  and  exceptional  postconditions)  for  a 
initial  states,  both  in  its  standard  and  exceptional  domains.  Also,  i 
exceptional  postcondition  is  of  the  form  A(s*)  =  A(s),  that  is,  the  abstrt 
state  upon  exceptional  exit  is  the  same  as  upon  invocation.  A  total  operatit 
for  which  post  (E)  has  such  a  form  is  called  an  atomic  operation. 

Here,  Cristian  notes  that  since  exception  detections  may  signal  attempts 
to  violate  invariants  which  are  maintained  by  communicating  processes,  the 
notions  of  atomicity  with  respect  to  exceptions  (recovery  atomicity)  and 
atomicity  with  respect  to  synchronization  (conourrenoy  atomicity)  become 
interrelated. 

As  has  been  noted  above,  when  an  exception  occurrence  is  detected,  it  is 
possible  that  an  inconsistent  state  exists,  that  is,  one  for  which  the 
invariant  I  of  the  module  H  is  not  satisfied.  Since  further  use  of  an 
inconsistent  state  can  lead  to  unpredictable  results,  it  is  necessary  to 
recover  some  consistent  state.  The  set  of  state  variables  of  the  module  M  for 
which  a  consistent  final  state  s  may  be  reached  by  modifying  the  state  these 
variables  have  in  the  inconsistent  state  i,  and  for  which  the  final  state  s 
satisfies  the  relation 

I  (s)  and  post  (E)  (s’,s) 

is  called  a  reoovery  set  (RS).  Further,  an  inconsistency  set  (IS)  is  a 
recovery  set  for  which,  for  any  other  recovery  set  RS,  |IS|  <=  'RS!  (where  the 
vertical  bars  indicate  set  cardinality).  Thus,  IS  is  just  the  smallest  of  the 
(in  general)  several  possible  recovery  sets. 

When  atomicity  with  respect  to  exceptions  is  desirable,  there  are  some 
other  recovery  sets  of  interest.  The  lnoonsistenoy  oloaure  (IC)  associated 
with  the  inconsistent  state  i  is  defined  as  the  set  of  all  state  variables 
modified  between  entry  into  the  procedure  P  and  the  detection  of  an  exception 
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E  during  the  execution  of  P.  Note  that  IC  is  trivially  a  recovery  set,  since 
the  final  state  s  is  identical  to  the  initial  state  s'  upon  restoration  of  the 
initial  states  of  all  the  modified  variables,  and 

I  (a)  and  (A  (s’)  =  A  (s)) 

is  thus  satisfied.  A  crude  approximation  to  the  IC  is  obtained  by  storing  the 
whole  set  of  the  initial  states  of  the  state  variables  of  M,  obtaining  a  com¬ 
plete  checkpoint. 

If  atomic  behavior  is  not  necessary  for  a  procedure  P,  then  forward 
reoovery  may  be  used,  as  discussed  above.  Then  the  recovery  actions  are 
(necessarily)  based  upon  the  designer’s  knowledge  of  the  semantics  of  P.  If, 
however,  it  is  desirable  that  P  behave  atomically  with  respect  to  exceptions, 
then  the  use  of  IC  sets  or  checkpoints  to  restore  a  consistent  state  is  neces¬ 
sary.  As  has  been  noted  above,  this  method  is  called  backward  reoovery.  As 
we  have  seen,  it  is  possible  that  the  IC  or  checkpoint  may  be  determined 
automatically,  yielding  automatic  backward  reoovery,  in  contrast  to  explicitly 
programmed  backward  recovery. 

As  defined  above,  a  necessary  condition  for  the  atomicity  of  an  opera¬ 
tion  is  that  the  operation  be  total.  However,  in  practice  the  design  of  total 
operations  is  difficult.  Thus,  in  most  cases  the  designer  of  an  operation 
anticipates  only  3ome  subset  of  the  exceptional  occurrences  possible  in  that 
operation.  The  true  standard  and  exceptional  domains  of  the  operation  may 
therefore  be  other  than  those  which  the  designer  imagines.  The  portion  of  the 
exceptional  domain  for  which  the  designer  provides  a  specified  exceptional 
exit  point  is  called  the  anticipated  exceptional  domain  (AED).  That  portion 
of  the  ED  not  Included  in  the  AED  is  called  the  unanticipated  exceptional 
domain  (OED).  The  operation  may  terminate  normally  when  invoked  in  its  stan¬ 
dard  domain,  in  a  state  satisfying  post  (E)  when  invoked  its  anticipated 
exceptional  domain,  and  in  an  undetermined  state  when  invoked  in  its 
unanticipated  exceptional  domain. 

To  illustrate  these  concepts,  Cristian  rewrites  the  example  given  above 
as  follows,  where  the  intended  and  exceptional  services  were  specified  by  the 
relations 


post  ==  i  =  i'  +  j1 


post  (OW)  r=  (i=  i’)  and  (J  =  J’) 


and  the  procedure  body  is 


c 

ar 


m  ■  ■  *  •  m  *  •  *  »  "  •  *  »  ■  *  .  «  • 


.  •  . 
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nroc  P  signals  OW; 

1  :=  [OV  ->  signal  OW]; 

Here,  the  programmer  has  mistakenly  typed  instead  of  Then  the 

domains  for  this  example  are 

SD  ==  (i*  =  j»)  and  (i*  =  0  or  1*  =2) 

ED  ==  “SD 

AED  ==  “SD  and  (i**j*  not  in  PI) 

UED  ==  “SD  and  (i"»j'  in  PI) 

There  are  several  possible  outcomes  of  the  invocation  of  an  operation  in 
its  unanticipated  exceptional  domain:  it  may  never  terminate  (go  into  an 
infinite  loop);  a  lower  level  procedure  may  detect  (and  propagate)  an  excep¬ 
tion  not  anticipated  by  the  designer  of  the  operation,  and  for  which  a  handler 
does  not  exist;  the  operation  may  terminate  at  its  standard  exit  point  in  a 
state  not  satisfying  its  standard  specification;  or  it  may  terminate  at  its 
exceptional  exit  point  in  a  state  not  satisfying  its  exceptional 
specification. 

The  problem  of  handling  unanticipated  lower-level  exceptions  is  treated 
in  Ada  by  continuing  the  propagation  of  the  lower-level  exception  to  higher 
levels  if  no  handler  is  present.  Cristian  claims  that  this  solution  is 
dangerous  for  several  reasons.  According  to  the  principle  of  information 
hiding,  the  upper  level  procedure  may  know  nothing  of  the  lower  level  excep¬ 
tion,  and  thus  have  no  handler  for  it.  Also,  continued  propagation  violates 
the  principle  that  the  flow  of  control  should  return  from  the  invoked 
procedure  to  the  invoker.  In  effect,  the  flow  of  control  is  through  an 
undeclared  exit  point  from  the  procedure  propagating  the  exception. 

A  simpler  solution,  Cristian  states,  is  the  provision  of  default  hand¬ 
lers  for  these  unnamed  exceptions  by  the  compiler.  This  implicitly- provided 
handler  may  be  used  as  follows: 

oroo  P  signals  E ; 

begin 

e  e  e 

end  [  ->  DH] ; 

Here,  DH  denotes  the  default  handler,  and  the  ”  "  before  the  arrow  denotes  any 
exception  for  which  there  is  no  handler  explicitly  provided. 
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The  purposes  of  such  default  exceptlc  '  handlers  may  be  the  masking  of 
exceptions,  that  is,  making  it  appear  to  higher  level  procedures  that  no 
exception  has  occurred  at  all;  the  recovery  of  a  consistent  state;  or  the 
signalling  of  an  exceptional  occurrence  to  a  higher  level  procedure.  The  CLU 
language  is  oriented  towards  the  latter  goal,  in  effect  providing  a  default 
handler  of  the  form 

DH  ==  signal  FAILURE. 

The  language  SESAME  under  development  at  the  Universtity  of  Grenoble  is 
oriented  towards  both  recovery  and  signalling,  providing 

DH  ==  reset:  signal  FAILURE. 

Here,  the  reset  primitive  restores  the  initial  state  of  the  operation. 

The  recovery  block  mechanism,  on  the  other  hand,  is  oriented  towards 
fulfilling  all  three  goals.  A  recovery  block,  such  as 

RB  ==  ensure  post  Jay.  PO 

else  M  PI  else  FAILURE; 

may  be  expressed  in  terms  of  default  exception  handlers  as  follows: 

RB  ==  PO *  C  ->  reset: 

PI*  [  ->  reset:  signal  FAILURE]] 

where 

Pi'  ==  begin  Pi;  [  "post  ->  signal  FAILURE]  end: 
for  1=0,  1. 

Thus  default  handlers  are  at  least  equivalent  in  power  to  recovery  blocks. 
This  suggests  that  recovery  blocks  are  lmplementable  under  any  system  which 
provides  default  exception  handlers  and  a  reset  primitive.  Although  less 
powerful  than  default  exception  handling,  the  recovery  block  scheme  is 
preferable  (at  least  at  the  application  level)  since  it  provides  a  useful 
abstraction  of  a  rather  messy  technique. 

An  operation  is  said  to  be  weakly  tolerant  to  an  exception  D  if  D  is 
detected  and  the  (programmed  or  default)  handler  of  D  recovers  a  consistent 
state  before  propagating  D  to  the  invoking  procedure.  An  operation  is  strong¬ 
ly  tolerant  to  D  is  it  can  mask  the  occurrence  of  D  to  higher-level 
procedures.  As  may  has  been  seen  from  the  discussion  of  automatic  error 
recovery  above,  these  methods  may  be  used  to  render  the  transactions  of  a 
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system  strongly  or  weakly  tolerant  to  detected  unanticipated  exception 
occurrences. 

A  procedure  is  said  to  contain  a  design  (algorithmic)  fault  if  its  UED 
is  non-empty.  A  system  strongly  or  weakly  tolerant  to  failure  occurrences 
caused  by  design  faults  is  called  design  fault-tolerant. 

The  commitment  interval  of  a  transaction  is  defined  as  the  time  interval 
between  the  beginning  and  the  end  of  transaction  execution.  If  there  is  a 
design  fault  in  the  code  implementing  the  operation,  however,  the  acceptance 
test  may  not  detect  the  consequences  of  the  fault  (since,  by  the  definition  of 
design  fault,  the  acceptance  test  was  not  designed  such  that  the  part  of  the 
exceptional  domain  in  which  its  effects  fall  was  checked  by  the  test).  Thus 
the  acceptance  test  will  be  passed,  the  results  of  the  transaction  committed, 
and  recovery  made  impossible  should  the  consequences  of  the  design  fault 
manifest  themselves  later.  The  time  between  the  manifestation  of  a  design 
fault  and  the  detection  of  its  consequences  is  called  the  latenoy  interval. 

Automatic  (or  programmed)  backward  error  recovery  methods  are  adequate 
if  the  latency  intervals  of  all  transactions  are  contained  within  the  respec¬ 
tive  commitment  intervals  of  the  transactions.  However,  these  methods  cannot 
cope  with  situations  where  the  latency  intervals  of  transactions  may  stretch 
over  several  successive  transaction  executions. 

The  prevention  of  such  situations  is  tied  in  with  the  problem  of  the 
adequate  specification  of  acceptance  tests  such  that  the  UED  of  an  operation 
is  empty,  that  is,  so  that  there  are  no  design  faults  (undetected  exceptional 
occurences)  in  a  system.  This  problem  is  a  current  focus  of  research  at  New¬ 
castle  upon  Tyne. 

E.7  JBHECHQM&  JR  Recent  research 

Work  by  the  group  at  the  University  of  Newcastle  upon  Tyne  continues  in 
the  area  of  software  fault  tolerance  ([Rand8l]>.  Recent  work  there  includes 
investigations  into  the  design  of  reliable  remote  procedure  call  mechanisms 
([Shriv82]). 

Problems  in  the  implementation  of  recovery  blocks  include  the  selection 
of  checkpoint  intervals  and  of  appropriate  points  at  which  previously  check- 
pointed  information  may  be  discarded  ([Russ80]).  Since  the  discarding  of 
checkpoint  information  is  equivalent  to  commitment  to  the  results  of  the  chec- 
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kpointed  block,  this  issue  is  of  no  small  importance.  Another  problem  is  the 
design  of  acceptance  tests  for  the  recovery  blocks,  which  is  discussed  in 
detail  in  [Ande8l].  It  is  this  problem  which,  as  has  been  noted  above  in  the 
discussion  of  [Crls82]f  leads  to  inadequacies  of  automatic  backwards-recovery 
systems  when  design  faults  may  manifest  their  presence  after  the  commitment  of 
the  results  of  the  affected  block.  The  proper  design  of  acceptance  tests  is  a 
current  thrust  of  research  at  Newcastle  upon  Tyne. 

For  distributed  systems,  the  problem  of  coordination  of  the  separate 
processes  in  a  recoverable  action  may  be  solved  by  the  two-phase  commit 
protoool  of  Gray  ([Ande8l]).  Here,  a  separate  "coordinator"  process  ensures 
that,  if  any  process  requests  backward  recovery,  all  processes  are  Instructed 
to  restore  to  their  recovery  points.  This  is  an  extension  of  the  "conver¬ 
sation"  mechanism  described  above.  Work  has  started  at  Newcastle  upon  Tyne  on 
a  search  for  communication  protocols  for  recovery  which  can  identify  recovery 
lines  without  the  necessity  for  a  central  coordinator,  or  the  exchange  of 
large  amounts  of  control  information  on  a  (possibly  unsafe)  message-passing 
system  ([Ande8l]). 

A  possible  strategy  which  should  be  considered  in  adding  algorithmic- 
failure  recovery  mechanisms  to  PRONET  is  the  notion  of  "overlaying"  a  back-up 
process  on  the  address  space  of  its  failed  predecessor.  This  scheme  would 
have  the  additional  advantage  of  allowing  transparent  replacement  of  existing 
permanent  network  processes.  Old  software  could  be  replaced  at  an  appropriate 
time  (say,  at  a  checkpoint)  by  overlaying  a  new  version  on  the  address  space 
of  the  old  software,  without  having  to  halt  the  entire  program.  A  similar 
scheme  is  discussed  in  [Ande8l],  where  it  is  suggested  that  older  versions  be 
retained  as  the  back-up  algorithms. 

Allchin  and  McKendry  ([Allc82])  have  proposed  that  recent  work  in  the 
database  field  on  "semantic  correctness"  (as  opposed  to  strict  enforcement  of 
correctness  criteria,  such  as  serializability)  may  be  extended  to  the 
decentralized  global  operating  system  for  a  local  area  network  which  is 
currently  under  development  at  Georgia  Tech.  In  their  model,  support  for  data 
management  is  constructed  using  abstract  data  types  —  instances  of  which  are 
"objects"  —  together  with  nested  actions.  They  argue  that  serializability  is 
often  too  strong  a  correctness  criterion  for  the  abstract  behavior  of  an 
object,  and  that  it  is  sometimes  necessary  or  desirable  —  especially  for 
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efficiency  considerations  —  that  the  Implementation  of  an  object  violate 
strict  serializability.  Synchronization  and  recovery  for  objeots  are  thus 
user-defined,  since  the  writer  of  an  object  has  semantic  knowledge  of  the 
object  which  would  be  extremely  difficult,  if  not  impossible,  for  the  system 
to  determine. 

Similar  considerations  may  be  applied  to  the  design  of  algorithmic  fault 
tolerance  features  in  PRONET.  In  particular,  the  use  of  knowledge  of  the 
writer  of  a  recovery  block  about  the  objects  on  which  the  block  is  based  may 
lead  to  increases  in  efficiency  in  the  use  of  the  recovery  cache.  Another 
possible  line  of  investigation  would  be  the  application  of  Allchin’s  object- 
based  recovery  model  to  backwards  recovery.  Investigations  into  automatic 
backwards-recovery  schemes  thus  far  have  been  concerned  with  action-based 
recovery,  that  is,  the  recovery  information  has  been  associated  with  the 
operations  rather  than  with  the  objects.  Only  very  recently  has  work  appeared 
which  is  concerned  (even  peripherally)  with  recovery  in  object-oriented 
languages  or  systems  ([Cox83]). 

Considerable  further  study  of  the  reliability  issue  is  required. 
Programming  techniques  must  be  developed  to  effectively  utilize  the  failure 
handling  features.  These  techniques  may  influence  future  refinements  of  the 
process  description  language,  since  they  are  likely  to  be  rather  complex. 
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A  SURVEY  OF  QUEUEINQ  NETWORK  MODELS  OP  COMPUTING  SYSTEMS 


John  A.  Miller 

F.1  INTRODUCTION 

In  the  design  and  analysis  of  computing  systems,  because  of  their 
ever  increasing  complexity,  it  has  become  necessary  to  construct  models  of 
these  systems.  The  use  of  mathematical  or  other  suitably  precise  models, 
enables  one  to  abstract  the  essential  features  of  systems  for  detailed  study 
of  their  behavior,  interactions,  and  effects  on  total  system  functionality  and 
performance.  This  process  of  abstraction  and  quantification  has  the  advantage 
of  enhancing  the  understanding  of  systems.  For  example,  in  attempting  to 
understand  a  particular  operating  system,  one  might  find  the  high  level 
approach  of  a  model  more  palatable  than  trying  to  ascertain  the  behavior  of 
the  system  from  the  knowledge  of  which  bits  get  set  when.  An  even  more 

important  advantage  of  modeling  is  that  it  facilitates  the  use  optimization  in 
designing  or  improving  systems. 

For  a  model  to  be  of  use  in  studying  a  complex  computing  system,  it  must 

come  to  grips  with  the  following  complications :  The  demands  placed  upon  the 

system  are  of  a  probabilistic  nature,  various  activities  are  occuring  at 
various  places  in  the  system,  and  finally  these  activities  may  be  interdepen¬ 
dent  and  occur  concurrently.  Queueing  network  models,  first  introduced  by  R. 

R.  P.  Jackson  [Jack54],  are  a  useful  tool  in  dealing  with  these  com¬ 

plexities.  Basically,  these  models  represent  the  system  as  a  network  of  nodes 
and  ares.  Each  node  represents  a  device  in  the  system  and  is  composed  of  a 

set  of  servers  that  are  feed  by  a  queue.  Each  arc  represents  a  possible  flow 

path  for  jobs  or  work  requests.  From  a  suitable  specification  of  the  model  (a 
set  of  equations),  the  model  can  be  solved  to  determine  the  performance 

characteristics  of  the  system,  e.g. ,  throughput,  response  times,  device 
utilizations,  and  queue  lengths. 

The  purpose  of  this  paper  is  to  survey  queueing  network  models  for  com¬ 
puting  systems.  The  paper  is  divided  into  two  parts.  In  the  first  part  we 
will  consider  the  mathematics  of  queueing  networks.  Specifically,  we  will 
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consider  the  elementary  theory,  various  solution  techniques,  and  software  pac¬ 
kages  used  to  solve  queueing  networks.  In  the  second  part  of  this  paper  we 
will  consider  some  specific  modeling  studies.  This  will  be  done  from  an 
evolutionary  point  of  view  (from  simple  uniprocessor  systems  to  complex 
distributed  processing  systems).  A  commment  on  notation  is  in  order  at  this 
point  -  the  symbols  E  and  TT  will  be  used  to  denote  the  summation  and  product 
operators  respectively. 

F.2  QOBPEING  NETWORKS 
F.2.1  Basic  Theory 

Before  considering  some  of  the  more  complex  techniques  used  to 
solve  queueing  network  models,  let  us  first  examine  some  of  the  elementary 
theory.  We  first  consider  open  Jaoskson  networks  [Heym82].  The  solution  to 
such  networks  is  particularly  simple  since  the  distributions  are  Markovian 
(probabilities  are  dependent  only  on  the  current  state  of  the  system,  not  on 
its  history  or  elapsed  time).  Specifically,  the  model  makes  the  following 
assumptions. 

1)  [Structure]  The  network  consists  of  N  interconnected  service  centers. 

2)  [Arrivals]  Exogenous  (from  outside)  customers  arrive  at  service  center  i 
according  to  a  Poisson  process  (exponential  interarrival)  with  rate  yi# 

3)  [Routing]  After  receiving  service  at  center  i,  a  customer  leaves  the 
network  with  probability  r^Q  or  g0es  instantaneously  to  service  center  j  with 
probability  rij  (“here  node  0  can  be  thought  of  as  a  special  source/sink 
node).  The  routing  probabilities  r±J  fonn  a  Markov  chain  with  transition 
(routing)  matrix  R  =  (rij)* 

M)  [Servloe  Center]  Service  center  1  consists  of  an  infinite  queue  that  feeds 
Ci  identical  servers.  The  service  discipline  is  first-come-first-served 

(FCFS),  and  the  service  times  are  independent  identically  distributed 

exponential  random  variables  with  mean  1/ui* 

We  will  want  to  obtain  solutions  to  these  types  of  queueing  network 

models,  that  will  specify  the  probability  of  the  system  being  in  a  certain 

state.  Here  the  state  of  the  system  will  be  a  vector  that  specifies  the  num¬ 
ber  of  customers  at  each  service  center,  a  =  (si ,S2»... »sn^*  From  this 

information  one  can  then  calculate  other  characteristics  of  the  system,  e.g. , 

waiting  times,  response  times,  and  throughput. 

First  we  need  an  expression  for  the  total  asymptotic  arrivals  at  each 
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queue.  This  is  given  by  the  traffio  equations,  where  arrivals  at  queue  i  are 
given  by 

ai  *  Yi  +  Ej=1..N*j*\}i  for  i  =  1..N. 

This  is  a  set  of  N  equations  in  N  unknowns,  a  =  (ai,...,aN^*  which  can  be 
shown  to  have  the  following  solution 

a  =  y(I  -  R)-1. 

With  this  result  in  hand  we  are  ready  to  find  the  probability  that  the 
system  is  in  state  s,  P(s,t).  Specifically,  we  are  Interested  in  the  steady 
state  solution  where  the  system  is  in  statistical  equilibrium  p(s)  = 
llm^P(s,t).  Note,  transient  state  solutions  are  also  useful,  but  are 
generally  harder  to  obtain. 

To  obtain  a  steady  state  solution,  we  apply  the  principle  of  conserva¬ 
tion  of  flow  to  get  a  set  of  flow  balanoe  equations.  These  equations  can  be 
complex  in  general,  but  are  not  difficult  for  a  single  queue  such  as  an  M/M/o 
queue.  An  M/M/c  queue  has  an  effective  service  rate  of 


where  s  is  the  number  of  customers  in  the  system.  The  flow  balanoe  equations 

specify  that  the  rate  at  which  customers  leave  state  s,  (y+us^P^3^*  musb  equal 
the  rate  at  which  customers  enter  state  a,  yp(s-l)  +  ua+1p(s+l).  Hence  the 
flow  balance  equations  are 

yp(0)  =  up( 1 ) 

^y+us)p(s)  =  yp(s-i)  +  ua+ipCs+1)  s  >  0 

which  can  be  solved  recursively  to  obtain  after  normalization 

p(s)  =  {  p(0)(y/u)3/s!  if  s  <  e 

{ 

{  p(0)(y/u)s/c!c3“° 

where  p(0)  =  [Es=0. .c-1(y/u)S/sl  +  <y/ta)0/o!(1-y/eu)H • 

Getting  back  to  the  original  problem,  thanks  to  the  J.  R.  Jackson 

theorem  [Jack57],  we  can  decompose  the  network  into  N  M/M/ci  queues*  111113  the 
solution  is  formed  from  the  product  of  independent  component  probabilities, 


» *  * 
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p(s)  =  TTisi..NPi('si) 

where  -  |  pi(0)(ai/ui)8i/sil 

{  pi(0)(ai/u1)si/ci!oisi“ci 


if  si  <  ci 


and  Pi(0)  is  the  analogous  normalization  sum.  Notice  that  the  key  to  the 
tractability  of  this  solution  is  the  fact  that  a  product  form  solution  could 
be  found. 

Closed  Jaokson  networks  are  also  a  useful  type  of  model.  The  assump¬ 
tions  for  closed  networks  are  the  same  as  those  for  open  networks,  except  that 
there  are  a  fixed  number  of  customers  (jobs)  that  circulate  through  the 
network  (i.e.,  there  are  no  exogenous  arrivals  or  departures).  Using  the 
Jackson-Gordon-Newell  theorem  [Jack63,  Gord67],  we  can  obtain  a  product  form 
solution  similar  to  the  one  obtained  for  open  networks, 

Pts)  =  CTTi-i..NPi(s1) 

where  Pi(si)  =  •'  (aiAii)8*/8*!  if  s±  <  ci 

(Vui^i/c^Ci81-01  - 

and  C  is  the  normalization  constant. 


These  two  types  of  queueing  network  models  form  the  basis  for  the 
elementary  theory  of  queueing  networks.  When  their  assumptions  reasonably  fit 
the  real-world  problem  being  analyzed,  they  provide  easily  obtained  exact 
solutions.  However,  the  real-world  is  usually  not  so  cooperative,  so  that 
solution  techniques  to  more  general  models  will  be  needed. 

F.2.2  Solution  TOQhnlflUM 

When  faced  with  complex  problems,  it  is  advantageous  to  have  a  large 
arsenal  with  which  to  attack  the  problems.  Below  is  an  overview  of  some  of 
the  more  useful  techniques  used  to  solve  queueing  networks.  They  are 
presented  in  the  rough  order  in  which  one  should  try  to  use  them,  i.e.,  if  a 
problem  yields  to  exact  analysis  use  it  or  try  the  next  approach,  etc. 

F.2.2. 1  Bxaot  Analysis 

Here  we  consider  a  general  solution  technique  that  yields  exact  closed- 
form  solutions.  Models  that  have  such  solutions  are  oalled  BCMP  networks 
[Bask75],  and  are  generalizations  of  Jackson  networks.  For  a  model  to  be  a 
BCMP  network  it  must  satisfy  the  following  set  of  assumptions. 


£v|  1)  [Structure]  The  network  consists  of  N  service  centers  and  K  classes  of 
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customers. 

2)  [Arrivals]  The  types  of  arrivals  determine  the  type  of  network.  An  open 
network  has  exogenous  arrivals,  a  closed  network  does  not,  and  a  mixed  network 
is  open  for  some  classes  and  closed  for  others.  There  are  two  basic  types  of 
exogenous  processes.  In  the  first,  the  arrival  rate  to  the  network  is  Poisson 
with  mean  dependent  on  the  total  number  of  customers,  y'(M(s)),  where  M(s)  is 
the  number  of  customers  in  the  network.  In  this  case,  the  exogenous  arrival 
rate  at  which  class  k  customers  arrive  at  center  i  is  y^  _  y»qil{  where  the 

are  fixed  probabilities.  In  the  second,  the  arrival  rate  to  subchain  h 
(see  below)  is  Poisson  with  mean  y‘(M(a|Eh)f  ln  whieh  oase  y±k  s  y,Qik  for 
each  subchain. 

3)  [Routing]  A  customer  of  class  k  who  completes  service  at  center  i  will  next 

require  service  at  center  j  in  class  1  with  probability  r^  ji*  The  routing 
probabilities  ril{f form  a  Markov  chain  with  transition  (routing)  matrix  R  = 
(rik  ji^*  The  Markov  chain  is  assumed  to  be  decomposable  into  m  subchains, 

where  denote  the  sets  of  states  of  these  subchains  (a  state  in  this 
context  refers  the  customer  (center  i,  class  k)). 

4)  [Service  Center]  There  are  four  types  of  service  centers  allowed  in  BCMP 

networks.  A  type  1  service  center  consists  of  an  infinite  queue  feeding  c* 
identical  servers.  The  service  discipline  is  first-come-first-served  (FCFS), 
and  all  customers  have  the  same  exponential  service  time  distribution.  The 
service  rate  can  be  state  dependent,  u(M(s1)J  where  MUi)  is  the  number  of 
customers  at  the  service  center.  A  type  2  service  center  consists  of  an 
infinite  queue  feeding  a  single  server.  The  service  discipline  is  processor 
sharing  (PS),  and  each  class  of  customers  may  have  a  distinct  service  time 
distribution.  Note  PS  is  the  limiting  case  of  round  robin  (RR)  where  the  time 

quantum  approaches  zero.  A  type  3  service  center  has  no  queue  and  ci  servers* 
so  that  at  any  time  the  center  can  hold  at  most  c±  customers.  Each  class  of 
customers  may  have  a  distinct  service  time  distribtion.  A  type  4  service 
center  consists  of  an  infinite  queue  feeding  a  single  server.  The  service 
displine  is  preemptive-resume  last-come-first- served  (LCFS),  and  each  class  of 
customers  may  have  a  distinct  service  time  distribution.  Note  in  LCFS  an 
arriving  job  preempts  the  server  and  get  service  until  it  completes  (preempted 
job  resumes)  or  is  itself  preempted  [Klei76]. 

In  types  2,  3,  and  4  the  service  time  distributions  are  arbitrary,  but 
must  have  rational  Laplace  transforms.  Under  this  slight  restriction,  one  i. 
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able  to  represent  the  service  time  distributions  as  a  sequence  of 

exponentially  distributed  stages  using  the  method  of  stages  [Bask75]. 

To  solve  BCMP  networks,  one  can  follow  a  procedure  similar  to  the  one 

given  for  Jackson  networks.  Here  the  state  of  the  system  is  given  by  s  = 

(si,...,Sn)  where  each  Si  is  now  a  vector  that  completely  specifies  the  status 

of  service  center  i,  -  (si,...,sn^)  where  n^  is  the  number  of  customers  at 

center  i  and  S-  is  the  class  of  the  jth  customer  in  line.  The  traffic 
J 

equations  for  each  subchain  are 

eik  =  <?ik  +  EjleEhejlrjl,ik  for  U»k)«Eh 
or  multiplying  through  by  y*  to  get  a  more  familar  form 

aik  =  yik  +  ® jleEJi3 jlp jl , ik  for  (i,k)eEh* 

These  equations  are  a  direct  generalization  of  the  ones  given  for  Jackson 
networks  and  can  be  solved  similarly. 

We  are  now  in  a  position  to  find  the  steady  state  solution,  using  what 
are  called  the  (local)  independent  balance  equations,  which  equate  the  rate  of 
flow  into  a  state  by  a  customer  entering  a  stage  of  service  to  the  flow  out  of 
that  state  due  to  a  customer  leaving  that  stage  of  service.  Note,  if  a 
customer  is  queued,  his  stage  will  correspond  to  the  stage  of  service  he  will 
be  in  when  he  next  gets  service.  Since  the  global  balance  equations  are  the 
sum  of  the  independent  balance  equations,  independent  balance  is  a  sufficient 
condition  for  global  balance.  The  solutions  to  BCMP  networks  are  specified  by 
the  Baskett-Chandy-Muntz-Palacios  theorem.  The  steady  state  probabilities  for 
the  case  of  type  1  arrivals  and  type  1  service  centers  are  given  by 

p(s)  =  Cd(s)TTi=1_NPl(si) 

where  pi(si)  =  ( 1/u1)niTTj=i . .nieisj 
d(8)  = 

and  C  is  the  normalizing  constant.  The  rest  of  the  cases  are  similar  but 
somewhat  messy  products  (see  [Bask75]  for  details). 

This  solution  is  similar  to  the  solutions  obtained  for  Jackson  networks. 
It  is  again  a  product  form  solution,  implying  that  specific  solutions  can 
easily  be  computed.  In  fact,  BCMP  networks  define  a  very  general  class  of 
queueing  network  models  that  yield  exact  closed-form  solutions.  These  models 
are  flexible  enough  to  be  useful  in  modeling  real  computing  systems.  For 


Appendix  F  Queueing  Network  Models  Page  109 

example,  type  1  service  centers  (FCFS)  are  good  models  for  secondary  storage 
I/O  devices.  Type  2  and  4  service  centers  (PS  and  LCFS)  are  good  models  for 
processors  since  LCFS  is  an  efficient  preemptive  scheduling  algorithm  and  PS 
is  limiting  round  robin  (RR).  And  type  3  service  centers  (no  queueing)  are 
good  models  for  terminals  and  routing  delays  in  computer  networks.  If 
however,  we  violate  one  of  the  basic  assumptions  we  may  not  be  able  to  find  an 
exact  closed-form  solution. 

F .2.2.2  Operational  Analysis 

If  the  purpose  of  the  analysis  is  to  study  an  existing  system  for  say 
tuning  or  upgrading,  and  statistics  can  be  gathered  by  monitoring  the  system, 
then  operational  analysis  is  a  useful  and  easy  to  understand  tool  [Denn78]. 
It  replaces  the  usual  assumptions  of  stationary  stochastic  processes  used  in 
classical  queueing  theory,  with  simple  operational  (measurable)  assumptions 
that  can  be  verified  by  monitoring  the  running  system.  The  basic  assumptions 
are  the  following. 

1 )  [Measurability]  All  quantities  of  interest  are  precisely  measurable. 

2)  [Flow  Balance]  During  a  reasonably  long  observation  period,  the  number  of 
arrivals  at  each  service  center  approximately  equals  the  number  of  departures 
(completions)  from  that  service  center. 

3)  [Homogeneity]  The  routing  of  jobs  must  be  independent  of  local  queue 
lengths,  and  the  mean  time  between  service  completions  at  a  given  device  must 
not  depend  on  the  queue  lengths  of  other  devices. 

To  use  this  approach  one  measures  certain  basic  quantities  directly  from 
the  system,  typically  the  following. 

T  =  length  of  observation  period 
A  =  number  of  arrivals  in  time  T 
B  =  total  time  the  system  is  busy  in  time  T 
C  =  number  of  completions  in  time  T 

These  quantities  are  then  used  to  compute  other  quantities  called  derived 
quantities,  that  will  hopefully  give  a  reasonable  characterization  of  the 
average  behavior  of  the  system.  Some  of  the  more  important  derived  quantities 
are  the  following. 

y  =  A/T  =  arrival  rate 
X  =  C/T  =  output  rate 
U  =  B/T  =  system  utilization 
S  s  B/C  =  mean  service  time 

Further  there  are  operational  laws  and  theorems  that  can  be  shown  to  be  true 
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when  the  system  satisfies  the  basic  assumptions.  Examples  of  these  are  the 
following. 

A  =  C  :  job  flow  balance 

U  =  yS  :  utilization  limit  theorem 

These  numbers  can  then  be  used  as  a  guide  for  tuning  or  upgrading  the  system, 
and  are  especially  useful  in  Identifying  bottlenecks.  It  turns  out  that  the 
equations  derived  from  the  operational  approach  agree  with  their  traditional 
Markovian  counterparts.  This  helps  explain  the  robustness  of  stochastic 
queueing  network  models  (they  seem  to  have  good  acouracy  even  when  their 
assumptions  are  in  doubt). 

The  advantages  of  this  approach  are  that  it  can  be  applied  to  any  system 
that  satisfies  simple  assumptions,  calculations  envolve  simple  formulas,  and 
it  is  easy  for  practitioners  (systems  analysts)  to  apply  (  one  does  not  need  to 
learn  queueing  theory).  The  disadvantages  are  that  it  is  only  applicable  to 
existing  systems  that  have  good  monitoring  capabilities,  and  only  average 
behaviors  are  considered  (part  of  the  beauty  of  classical  queueing  theory  is 
that  it  predicts  non- intuitive  results  due  to  randomness). 

P.2.2.3  Numerical  Analysis 

When  the  state  of  the  system  can  be  fully  specified  by  the  number  and 
types  of  customers  at  the  various  service  centers,  then  a  steady  state 
solutions  may  be  obtained  by  solving  the  flow  balance  equations.  [Note  to 
fully  specify  the  state  of  a  GI/G/1  queue  time  must  be  included  in  the  state 
description.]  In  general,  these  equations  constitute  an  infinite  set  of 
linear  equations.  Thus  we  must  exploit  a  recursiveness  in  these  equations  to 
obtain  a  closed-form  solution,  but  this  cannot  always  be  done.  However,  in 
many  cases  such  as  closed  networks,  the  numbers  of  possible  states  is  finite, 
and  hence  the  balance  equations  form  a  finite  set  of  linear  equations.  We  may 
therefore  apply  the  techniques  of  linear  algebra  to  obtain  a  numerical 
solution. 

Because  these  equations  are  usually  very  sparse,  an  iterative  solution 
technique  is  more  efficient  than  the  more  oommon  elimination  based  techniques. 
A  simple  procedure  that  may  be  used  is  called  Guass-Seldel  iteration  [Coop8l]. 
Suppose  one  has  the  following  set  of  linear  equations 
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Divide  each  row  of  A  and  b  by  a±i  to  get  Bx  -  d.  Letting  B  =  I  -  L  -  0  where 
L  and  0  are  lower  and  upper  triangular  respectively,  we  have 


(I  -  L  -0)x  =  d  which  may  be  rewritten 


x  =  Lx  +  Ox  +  d. 

Starting  with  an  intial  guess  jfi  we  may  iterate  using  the  following  equation 
to  converge  to  the  solution, 

xn+1  =  Lxn+1  +  Ux11  +  d. 


Note,  in  practice  a  more  sophisticated  version  such  as  the  method  of  succes¬ 
sive  overrelaxation  [Coop8l ]  is  often  used. 


F.2.2.4  Approximate  Analysis 

Because  real  computing  systems  can  be  quite  complex,  the  models  of  them 
need  to  be  highly  flexible.  Typically,  when  a  system  is  thought  to  be  too 
complex  to  be  solved  by  exact  closed  form  or  numerical  methods,  simulation  is 
resorted  to.  This  however  need  not  be  the  case.  The  use  of  approximate  solu¬ 
tion  techniques  provides  a  way  to  obtain  answers  of  reasonable  accuracy,  to 
very  general  queueing  network  models.  The  word  reasonable  is  used  rather 
loosely;  one  of  the  difficulties  with  approximation  techniques  is  estimating 
their  error  bounds. 


Before  presenting  these  techniques,  let  us  first  consider  some  com¬ 
plications  that  make  the  previous  techniques  intractable,  but  have  been  solved 
by  approximation  techniques  [Chan78], 

1 )  [Distributions  and  Disciplines]  If  the  arrival  distribution,  service 
distribution,  and  queueing  discipline  do  not  satisfy  the  assumptions  for  BCMP 
networks,  then  it  is  likely  that  an  approximation  technique  will  be  needed.  A 
good  example  of  this  is  a  network  with  priority  disciplines. 

2)  [Multiple  Resouroe  Holding]  When  a  customer  (job)  needs  more  then  one 
resource  simultaneously  to  obtain  service,  an  approximation  will  be  needed. 
An  example  of  this  is  a  passive  resouroe,  a  resource  that  does  not  have  a  ser¬ 
vice  time  associated  with  it,  but  limits  the  population  of  jobs  that  may 
utilize  other  devices. 

3)  [Blocking]  In  networks  where  finiteness  of  the  queues  is  critical,  such  as 
a  packet  switching  network,  a  device  (server)  may  be  blocked,  i.e.,  prevented 
from  serving  jobs  in  its  queue  beoause  a  queue  elsewhere  in  the  network  is 
full  and  cannot  aocept  any  more  jobs.  Again  an  approximation  technique  will 
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be  needed. 

4)  [Scheduler]  When  the  delays  due  to  waiting  for  a  scheduler  to  be  activated 
become  significant,  approximations  will  be  necessary.  Schedulers  are  a 
particular  complication  because  once  activated  they  serve  many  jobs  in  a 
relatively  short  time,  so  that  it  is  hard  to  model  the  service  time  of  a 
scheduler. 

5)  [Parallelism]  If  a  system  has  tasks  whose  subtasks  can  be  run  in  parallel, 
then  approximation  is  again  called  for.  An  example  of  this  is  CPU:I/0  over¬ 
lapped  processing,  where  the  CPU  and  an  I/O  device  service  a  job  in  parallel. 

6)  [Routing]  If  the  probability  that  a  Job  completing  service  at  device  i  goes 

to  device  j  is  not  a  constant  r^jf  but  depends  on  the  state  of  the  system, 
then  an  approximation  will  usually  be  needed.  An  important  example  of  a  type 
of  dynamic  routing  is  load  balancing  (e.g. ,  in  say  a  pooled  computer  system 
the  scheduler  would  send  a  newly  arriving  job  to  the  computer  with  the  least 
expected  delay). 

We  will  now  look  at  two  types  of  approximations  that  have  been  used  suc¬ 
cessfully,  decomposition  and  diffusion.  Decomposition  approximations  solve 
queueing  network  problems  by  breaking  the  network  into  pieces,  solving  these 
pieces  separately,  and  finally  aggregating  these  subsolutions  to  obtain  a 
solution  to  the  whole  model  [Chan78].  The  Justification  for  the  accuracy  of 
this  approach  Is  first.,  its  application  to  networks  with  product  form 
solutions  yields  exact  results,  and  second. »  if  one  partitions  the  network  up 
into  loosely  coupled  subnetworks  then  the  approximation  will  likely  be  good 
since  the  interaction  effects  between  the  subnetworks  will  be  minimized.  The 
simplest  decomposition  approach  is  called  the  flow  equalvalent  method.  Here 
the  strategy  is  to  partition  the  network  into  loosely  coupled  subnetworks, 
replace  each  subnetwork  with  a  flow  equivalent  composite  queue,  solve  each 
subnetwork  to  determine  the  behavior  of  its  associated  composite  queue,  and 
finally  solve  the  new  aggregate  network  composed  of  the  composite  queues. 
Note  that  the  partitioning  may  need  to  be  applied  recursively  to  some  sub¬ 
networks  to  achieve  a  tractable  solution  (i.e. ,  continue  breaking  up  the 
network  into  smaller  pieces  until  the  pieces  are  small  enough  to  be  solved  by 
some  other  technique,  ideally  exact  analysis). 

Diffusion  approximations  can  be  used  to  obtain  approximate  solutions  to 
queues  with  general  arrival  and  service  time  distributions  (e.g.,  a  GI/G/1 
queue)  [Klei76,  Chan78],  The  time  dependent  behavior  of  a  queue  Is  specified 
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by  p(t,n;no)»  ^he  Probability  that  at  time  t  there  are  n  customers  in  the 

queue  given  that  there  were  ng  customers  at  time  t  =  0.  p(t,n;no)  may  be 

found  by  solving  a  set  of  differential  equations  (one  for  each  value  of  n). 

However,  for  general  distributions  this  can't  be  done;  hence  an  approximation 
is  needed.  The  idea  is  to  replace  the  discrete  variable  n  by  the  continuous 
variable  x  >  0,  where  the  correspondence  between  n  and  x  is  n  =  [ x ] . 

Making  the  substitution  of  x  for  n  and  the  density  function  f(t,x;xo^ 
for  p(t,n;n0)  and  taking  the  Taylor's  expansion  to  second  order  of  the 
differential  equations,  one  obtains  a  partial  differential  equation,  the 
diffusion  equation 

ft(t,x;x0)  =  -cfx(t,x;xo)  +  .5D2fxx(t,x;xo)  x,t  >  0 
where  c  and  D  are  functions  of  the  arrival  and  service  distributions'  means 
and  variances  [Heym82].  This  equation  models  Brownian  motion  where  a  group  of 
particles  is  released  at  xo  and  dlffuse3  outward  because  of  collisions,  sub¬ 
ject  to  the  constraint  of  a  reflecting  boundary  at  x  =  0. 

To  obtain  a  solution  to  this  equation  we  will  use  the  distribution  func¬ 
tion,  F(t,x;Xg)  (integral  wrt  x  of  f(t,x;x0)),  rather  than  the  density  func¬ 
tion.  It  can  be  shown  that  F  also  satisfies  the  diffusion  equation 

Ft(t,x;x0)  =  -cFx(t,x;x0)  +  .5D2Fxx(t,x;x0) 
and  has  the  following  initial  and  boundary  conditions 

F(0,x;x0)  =  j  o  if  x  <  x0 
F(t,0,x0)  =  o  t  >  0. 

The  solution  is  given  by  [Heym82]  where  F(t,x;xQ^  equals 

PHI{(x-x0-et)/Df5}  -  exp(2cx/D2)PHI((-x-x0_et)/Dt*5} 

where  PHI  is  the  normal  distribution  function.  Finally,  the  steady  state 
solution  is  found  by  taking  the  following  limit 

F(x)  =  limtF{x,t;x0)  s  1  “  exp(2cx/D2) 

where  c  <  0.  Under  heavy  traffic  conditions  this  approximation  has  been  shown 
to  be  good  by  both  empirical  evidence  and  a  theorem  due  to  Iglehart  and  Whitt. 

In  using  the  diffusion  approximation  for  a  network  of  queues,  one 
generally  assumes  a  product  form  solution  and  analyzes  each  queue  indepen- 
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dently  [Chan78],  The  diffusion  approximation  can  also  be  useful  in  conjunc¬ 
tion  with  the  flow  equivalent  method  when  individual  queues  have  general 
arrival  and  service  distributions.  Diffusion  approximations  are  also  being 
applied  directly  to  special  networks.  For  example,  Foschini  uses  the 
diffusion  approximation  to  solve  routing  problems  for  a  system  with  parallel 
queues  [Fosc77]. 

F. 2.2.5  Simulation 

Finally,  if  a  model  does  not  yield  to  any  of  the  previous  techniques, 
then  one  can  simulate  the  system  to  get  sample  solutions  which  can  be 
statistically  analyzed  to  determine  the  characteristics  of  the  system. 
However,  one  should  not  go  about  simulation  in  a  haphazard  manner.  Such 
simulations  can  provide  unreliable  results.  For  queueing  network  models, 
regenerative  simulations  have  been  shown  to  give  accurate  results  [Chan78, 
Igle78,  Saue79a],  In  this  method  confidence  intervals  for  say  mean  response 
time  are  periodically  estimated,  and  a  sequencial  stopping  rule  is  applied  to 
determine  the  run  length  (these  problems  are  difficult  for  arbitrary 
simulations).  Simulations  are  also  useful  in  conjunction  with  analytic  tech¬ 
niques  (hybrid  approach).  For  example,  when  using  the  decomposition  (also 
called  hierarchical)  approach,  it  may  be  computationally  prudent  to  obtain 
numerical  solutions  to  the  submodels,  and  then  use  simulation  for  the 
aggregate  model.  Simulation  is  somewhat  analogous  to  the  goto  statement,  it 
is  very  powerful,  but  should  only  be  used  in  well  thought  out  ways. 

F.2.3  Queueing  JtotKQEk  Paatagen 

To  make  queueing  network  modeling  more  convenient,  packages  have  been 
developed  to  solve  these  models  [Saue79b].  Generally  these  packages  take  as 
input  a  specification  of  the  queueing  network  (via  an  interactive  dialogue  or 
a  special  purpose  language),  formulate  the  problem  mathematically,  and  solve 
the  equations  using  the  techniques  described  in  this  paper.  Let  us  now 
consider  some  of  the  major  packages  that  have  been  developed  (note,  many  of 
these  packages  are  available  either  commercially  or  otherwise). 

The  first  major  package  to  be  completed  was  RQA  by  Wallace  and  Rosenberg 
in  1966.  RQA  solves  queueing  networks  with  finite  state  spaces  by  formulating 
the  linear  (global)  balance  equations,  and  solving  them  by  numerical  analysis. 
The  use  of  this  approach  has  two  principle  weaknesses.  First,  there  is  a 
limit  to  the  number  of  states  that  a  system  can  have  (a  few  thousand  states). 
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Second,  for  systems  with  a  large  number  of  states  RQA  can  be  intolerably  slow. 

In  1973,  ASQ  was  completed  by  Keller.  ASQ  solves  queueing  networks  with 
product  form  solutions  using  exact  analysis.  Later  ASQ  was  extended  to  the 
hierarchical  solution  of  networks,  and  was  eventually  renamed  CADS.  The  chief 
limitation  here  is  that  the  networks  must  have  product  form  solutions. 

Foster  and  McGehearty  completed  in  1974  a  special  purpose  language  QAL 
and  implemented  a  simulation  solution  program  QSIM.  QAL  provided  several 
extensions  to  the  networks  of  RQA  and  ASQ,  such  as  allowing  passive  resources. 
The  primary  weaknesses  of  QAL  are  its  lack  of  non-simulation  solution 
implementations,  and  its  lack  of  support  for  representing  distinct  job  clas¬ 
ses. 

In  1975,  Sauer  completed  APPLOMB,  which  solved  a  general  class  of  queue¬ 
ing  networks  using  regenerative  simulation.  During  this  same  year  QNET4  was 
completed  by  Reiser.  QNET4  sovled  product  form  networks  with  multiple  (local) 
job  classes  using  exact  analysis.  In  1978,  these  two  packages  were  combined 
to  form  RESQ,  which  provides  a  fairly  comprehensive  solution  capability  and  a 
good  user  interface. 

BEST/1  was  completed  by  IBM  in  1977.  BEST/1  was  specifically  designed 
to  solve  capacity  planning  problems  in  computer  systems.  It  solves  slight 
variants  of  product  form  networks  using  exact  analysis  in  conjunction  with 
special  approximations  (the  details  are  proprietary). 

The  final  package  we  will  consider  is  QSOLVE,  which  was  completed  by 
Levy  in  1977-  QSOLVE  uses  an  approach  similar  to  RQA  in  that  it  uses  numerical 
analysis.  However,  it  is  oriented  toward  networks  similar  to,  but  violating 
product  form  (e,g. ,  it  allows  more  general  job  classes  and  queueing 
disciplines) . 

F.3  MODELS 

Now  that  we  have  a  feel  for  what  queueing  networks  are  and  how 
they  are  solved,  we  turn  to  the  modeling  process.  This  process  which  is  more  of 
an  art  than  a  science,  involves  a  careful  examination  of  the  system  (real  or 
hypothetical)  and  abstracting  out  the  essential  features  of  the  system 
relevent  to  the  aspect  of  performance  being  considered.  Modeling  studies  may 
focus  on  the  total  system  or  some  subsystem  such  as  the  operating  system,  the 
database  system,  or  the  communication  subsystem.  In  modeling  general  comput- 


Page  1 1 6 


Queueing  Network  Models 


Appendix  F 


ing  systems,  two  approaches  have  been  successfully  used,  queueing  networks  and 
simulation.  The  disadvantage  of  simulations  are  their  high  cost,  both  com¬ 
putational  and  developmental,  and  their  potential  unreliability  resulting  from 
programming  bugs  and  the  difficulty  of  applying  rigorous  statistical  analysis 
to  their  outputs  [Koba78].  For  all  but  highly  detailed  models,  queueing 
networks  offer  a  good  alternative.  The  reason  they  have  not  been  used  that 
much  is  that  many  of  the  advances  in  solving  these  models  has  come  about  in 
the  last  few  years,  and  as  of  yet  not  enough  experience  has  been  gained  in 
their  use. 

An  ideal  scenario  for  the  use  of  modeling  is  the  following  [Saue8l]:  In 
the  early  design  phases  use  simple  queueing  network  models  to  reject 
infeasible  designs  and  guide  design  improvements.  As  the  design  nears 
finalization,  it  should  be  represented  by  a  detailed  queueing  network  model. 
At  this  point,  if  there  is  sufficient  time  and  money,  a  detailed  simulation 
may  be  helpful.  It  can  capture  some  of  the  details  ignored  in  the  queueing 
network  model,  and  if  they  agree  it  provides  a  partial  validation  of  the 
queueing  network  model.  Note,  a  nice  feature  of  this  approach  is  that  if 
everything  goes  right,  only  one  costly  simulation  will  be  necessary.  Once  the 
system  is  operational  the  queueing  network  model  should  be  validated  by  com¬ 
paring  its  performance  predictions  with  performance  measurements  obtained  from 
the  running  system.  If  there  are  significant  disagreements,  then  the  results 
can  be  used  to  correct  the  deficiencies,  either  in  the  model  or  the  system. 
Once  the  model  is  validated,  it  can  be  used  to  configure  other  installations 
with  greater  confidence,  and  if  changes  in  the  system  are  needed,  it  can  be 
used  in  redesign  and  redevelopment. 

F.3.1  Some  Successful  Models 

To  see  how  queueing  network  models  are  used,  we  will  look  at  several 
modeling  studies.  In  the  rest  of  this  section  we  will  look  at  some  models 
that  have  been  used  successfully,  i.e.,  the  models  were  shown  to  be  accurate 
and  were  found  to  be  of  use  in  designing,  upgrading,  and/or  tuning  computing 
systems.  In  the  next  section,  we  will  focus  on  the  application  of  queueing 
network  models  to  a  new  area  of  high  potential,  where  results  just  recently 
began  coming  in,  namely  distributed  processing  systems. 

The  first  successful  application  of  a  queueing  network  model  to  a  com¬ 
puting  system  was  done  by  Scherr  in  1967*  He  applied  a  machine  repairman 
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model  to  the  Compatible  Time-Sharing  System  (CTSS)  at  MIT  [Grah78,  Munt75]. 
This  model  can  be  thought  of  as  a  closed  queueing  network  with  two  nodes,  one 
representing  the  central  system  (memory,  CPU,  and  I/O  devices),  and  the  other 
representing  a  collection  of  N  terminals. 


< 


I  I 

> ! 1 1 1 1 1 1 1  system  ! 


In  this  model  N  Jobs  circulate  around  the  network;  each  job  is 
permanently  associated  with  a  particular  terminal.  At  the  terminal  node  there 
is  no  queueing  so  that  a  job  goes  directly  to  its  associated  terminal,  and 
remains  there  for  the  duration  of  its  terminal  service  time  (think  time  of  its 
user)  which  is  modeled  as  an  exponential  random  variable.  At  the  central 
system  node  the  jobs  queue  up  to  obtain  its  services.  The  service  time  of  the 
central  system  represents  the  sum  of  the  program  execution  time  and  the 
unoverlapped  swap  time,  and  is  also  modeled  as  an  exponential  random  variable. 
[Note,  CTSS  was  an  early  interactive  system  where  user  programs  were  swapped 
in  and  out  of  memory,  implying  only  one  program  could  be  in  memory  at  a  time. ] 
Hence  there  are  three  possible  states  a  job  can  be  in:  1)  at  its  terminal, 
corresponds  to  a  user  thinking,  2)  in  the  central  system  queue  waiting  for 
service,  or  3)  receiving  service  from  the  central  system. 

Clearly  this  is  a  very  simple  model;  surely  it’s  too  simple  to  be  an 
accurate  predictor  of  performance.  Scherr  compared  the  model's  predicted  mean 
response  time  with  the  actual  response  time  experienced  by  users. 
Surprisingly, the  model  was  amazingly  accurate.  In  Scherr' s  words,  the  results 
were  "startling"  considering  the  simple  model  used  to  predict  the  performance 
of  a  "highly  complex  hardware-software  system."  This  high  accuracy  is  in  part 
explained  by  the  fact  that  the  central  system  serves  only  one  job  at  a  time; 
hence  the  model  agreed  well  with  the  configuration  of  the  real  system.  In 
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addition,  it  has  been  shown  by  theoretical  results  that  performance  measures 
such  as  average  delays  and  throughputs  are  rather  insensitive  to  service 
distributions  [Koba78]. 

In  1972,  this  same  model  was  applied  by  Lassettre  and  Scherr  in  the 
design  of  IBM’s  Time  Sharing  Option  (TSO)  with  a  single  partition  [Munt75]. 
The  model’s  predicted  mean  response  times  were  compared  with  measured  response 
times  that  were  generated  form  script  driven  workloads.  When  at  first  the 
model  and  the  measurements  did  not  agree,  Lassettre  and  Scherr  has  enough  con¬ 
fidence  in  their  model  to  claim  that  a  system  error  or  poor  scheduling  policy 
was  the  cause.  This  indeed  turned  out  to  be  the  case;  after  the  performance 
bug  was  located  the  model  gave  accurate  predictions  [Koba78]. 

The  next  major  advance  in  the  application  of  queueing  network  models  to 
computing  systems,  came  in  1971  when  Moore  modeled  the  Michigan  Terminal 
System  (MTS)  as  a  closed  Jackson  network  [Koba78,  Munt75].  His  model 
explicitly  represented  the  major  resources  of  the  system,  an  IBM  360/67  with  a 
dual  processor,  1.5  megabytes  of  memory,  2  paging  drums,  and  approximately  100 
terminals.  He  found  that  the  model  could  be  simplified  somewhat  by  treating 
lightly  used  resources  as  a  single  resource. 

As  specified  by  Jackson's  results  the  service  times  for  each  resource 
were  modeled  as  exponential  random  variables  whose  means  were  estimated  from 
measurements  of  the  system  in  operation.  Moore  observed  that  the  exponential 
distribution  did  not  fit  the  data  he  collected  very  well;  however  to  predict 
average  values  (e.g. ,  mean  response  time  and  resource  utilizations)  this  did 
not  have  much  effect  on  the  accuracy  of  the  predictions.  He  also  used  this 
measured  data  to  estimate  the  transition  probabilities  (probability  of  going 
to  resource  j  after  completing  service  at  resource  i). 

Moore  measured  the  system  over  10  to  15  minute  intervals,  using  the  data 
to  estimate  the  model  parameters  specified  above.  He  then  compared  the  model 
predictions  of  mean  response  time  and  resource  utilizations  to  those  obtained 
from  the  measured  data.  Again  good  accuracy  was  obtained,  the  predicted 
values  were  typically  within  10  percent  of  the  measured  values.  Considering 
the  complexity  of  the  system,  a  large  interactive  computing  system,  these 
results  are  very  good.  In  addition,  Moore  found  that  the  performance  was  very 
sensitive  to  the  load  on  the  system,  so  that  accurate  estimation  of  the  model 
parameters  was  essential  for  accurate  performance  predictions. 
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Also  in  1971 »  Buzen  used  a  particular  type  of  closed  Jackson  network  to 
analyze  multiprogramming  systems.  He  called  models  of  this  type,  central  ser¬ 
ver  models  [Koba78].  A  basic  central  server  model  consists  of  a  CPU  and 
independent  secondary  storage  and  1/0  devices. 
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This  closed  network  captures  the  basic  behavior  of  a  multiprogramming  system. 
The  number  of  jobs  that  circulate  through  the  system  corresponds  to  the  mul¬ 
tiprogramming  level.  A  typical  Job  will  progress  as  follows:  It  will  receive 
CPU  service  from  which  it  will  either  be  preempted  or  request  I/O  service, 
upon  completion  of  which  it  will  again  seek  CPU  service.  This  scenario  will 
be  repeated  indefinitely.  Clearly  a  real  job  does  not  have  an  infinite 
lifetime,  but  if  a  system  has  a  maximum  multiprogramming  level  and  is 
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reasonably  loaded,  we  can  think  of  a  completed  job  being  replaced  by  a  new 
job.  Hence  the  abstract  notion  of  an  infinite  job  is  a  reasonable  model. 
Buzen  first  used  these  models  to  study  the  throughput  of  batch  systems 
[Munt75]. 

Later,  Buzen  used  a  central  server  model  for  a  comprehensive  analysis  of 
the  IBM  Multiple  Virtual  Storage  (MVS)  operating  system  [Buze78].  The  purpose 
of  this  analysis  was  to  model  the  resource  allocation  mechanisms  of  MVS,  so 
that  given  current  or  future  workloads  for  a  system,  an  optimal  strategy  for 
upgrading  and  tuning  could  be  determined.  MVS  allows  an  installation  manager 
to  classify  workloads  (MVS  allows  batch  workloads,  time  sharing  workloads,  and 
transaction  processing  workloads),  and  provides  mechanisms  by  which  the 
allocation  of  resources  to  workloads  can  be  controlled.  Additionally,  within 
workloads  there  are  mechanisms  by  which  allocations  to  individual  Jobs  can  be 
controlled. 
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For  controlling  the  allocation  of  resource  amoving  workloads,  MVS 
provides  two  mechanisms,  one  which  regulates  access  to  memory,  and  the  other 
which  regulates  access  to  the  central  processor.  To  control  the  allocation  of 
memory,  the  installation  manager  divides  it  up  into  domains  (a  logical  region 
of  memory),  and  assigns  workloads  to  these  domains.  Since  each  domain  has  a 
maximum  allowed  multiprogramming  level,  this  domain  mechanism  regulates  the 
allocation  of  jobs  to  memory  according  to  their  job  class  (workload  clas¬ 
sification).  The  allocation  of  the  central  processor  is  controlled  by  the 
scheduling  algorithm.  Scheduling  amoung  workloads  is  done  using  a  preemptive- 
resume  priority  discipline,  where  jobs  of  higher  priority  preempt  Jobs  of 
lower  priority  which  are  resumed  upon  completion  of  the  higher  priority  Job. 
Thus  a  workload's  allocation  is  controlled  by  assigning  it  an  appropriate 
priority  level.  [Note,  within  a  single  priority  level  a  round-robin  (or 
equivalent)  scheduling  algorithm  is  used.] 

Given  this  first  level  of  resource  allocation  control,  MVS  also  provides 
for  second  level  of  control  using  mechanisms  that  allocate  resources  within 
workloads.  Here  decisions  are  not  made  on  the  basis  of  job  classification, 
but  rather  by  the  operating  system  monitoring  the  behavior  of  Jobs.  There  are 
two  mechanisms  by  which  this  control  is  carried  out,  domain  migration  and 
exchange  swapping.  Domain  migration  is  used  to  control  the  allocation  of 
memory.  The  idea  here  is  to  associate  several  domains  with  each  workload  and 
to  set  the  multiprogramming  level  lower  for  each  successive  domain.  Then  when 
a  job  has  consumed  to  many  service  units  (  weighted  sum  of  CPU  time,  I/O 
processing,  and  the  memory  space-time  product),  it  is  transferred  to  the  next 
domain.  If  the  job  is  transferred  to  a  domain  already  at  its  target  mul¬ 
tiprogramming  level,  the  job  will  be  swapped  out  of  main  memory.  The  second 
mechanism,  exchange  swapping,  is  also  used  to  control  the  allocation  of 
memory.  The  idea  here  is  that  for  all  jobs  a  dynamic  memory  priority  is 
periodically  computed,  and  when  for  a  given  domain  the  priority  of  a  job  in 
memory  falls  below  one  waiting  to  be  loaded  into  memory,  an  exchange  swap  is 
generated.  These  mechanisms  keep  jobs  from  monopolizing  main  memory. 

The  specific  system  that  was  modeled  was  an  IBM  370/168  Model  3  with  7 
megabytes  of  memory  and  a  total  of  46  disks,  dr^uns,  and  tape  drives.  The 
system  was  processing  a  variety  of  workloads,  prii.viplely  time  sharing  (TSO), 
batch  processing,  transaction  processing  (IMS),  and  certain  special  purpose 
subsystems.  In  developing  the  model,  Buzen  first  specified  the  job  classes; 
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three  classes  were  used  for  time  sharing  (short,  medium,  and  long  transac¬ 
tions),  one  was  used  to  represent  batch  processing,  and  a  fifth  was  used  to 
represent  various  system  overehead  processes.  Each  of  the  five  job  classes 
was  assigned  a  particular  domain. 

Next  Buzen  modeled  the  various  features  of  MVS,  especially  those  dealing 
with  resource  allocation.  Domain  migration  was  used  for  time  sharing  jobs. 
Intially,  all  such  jobs  would  be  in  the  first  domain,  and  as  time  progressed 
some  would  migrate  to  the  second  and  third  time  sharing  domains.  Hence  to 
capture  the  steady  state  behavior,  domain  migration  was  modeled  by  assigning 
the  appropriate  fraction  of  jobs  to  the  three  levels.  Interactive  swapping 
(whenever  an  interactive  job  is  waiting  for  terminal  input  and  another 
interactive  job  is  waiting  to  be  loaded  into  memory,  a  swap  is  generated)  was 
modeled  by  assuming  that  certain  I/O  devices  are  allocated  to  swapping  and 
setting  their  mean  service  time  to  the  average  swapping  time  for  that 
particular  device.  For  the  system  that  Buzen  modeled  both  drums  and  slower 
disks  were  used  for  swapping,  the  drums  being  used  until  they  are  full. 
Demand  paging  is  similarly  modeled  by  setting  the  mean  service  time  to  the 
average  paging  time  for  the  particular  devices  used.  Note,  page  reads  and 
writes  were  treated  differently;  since  a  job  must  wait  for  a  page  to  be  read, 
this  activity  is  considered  to  be  part  of  the  job's  demand  for  I/O  services, 
whereas  page  writes  are  considered  to  be  part  of  the  system  overhead.  Central 
processor  scheduling  which  used  a  preemptive-resume  priority  discipline  was 
modeled  using  certain  proprietary  approximations  (priority  scheduling  violates 
product  form  conditions).  Finally,  exchange  swapping  was  found  to  be  a  neg¬ 
ligible  factor  (not  used  freaquently) ,  and  was  therefore  not  included  in  the 
model. 

The  complete  queueing  network  model  was  formed  by  connecting  these  sub¬ 
models  together.  The  result  was  a  central  server  model  which  was  a 
generalization  of  the  one  shown  previously.  Buzen  used  the  queueing  network 
package  BEST/1  to  analyze  this  model.  BEST/1  is  specifically  designed  to 
analyze  central  server  models  of  this  form,  and  allows  the  following  features 
to  be  modeled. 

1)  [Job  Classes]  Multiple  job  classes  can  be  represented  where  each  class  has 
its  own  service  time  requirements  at  each  device.  Job  classes  can  be 
either  open  or  closed. 

2)  [Domains]  Multiple  domains  can  be  represented  where  each  domain  has  its  own 
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target  multiprogramming  level  and  a  separate  queue.  One  or  more  Job  classes 
can  be  assigned  to  a  domain. 

3)  [Disciplines]  All  product  form  queueing  disciplines  are  allowed.  In 
addition,  the  CPU  queueing  discipline  can  be  preemptive-resume  priority. 

4)  [I/O  Devices]  I/O  devices  each  have  their  own  service  times  (which  include 
channel  and  controller  delays), 

Buzen  collected  data  from  the  system  using  the  IBM  Resource  Measurement 
Facility  (RMF),  to  obtain  estimates  for  the  model's  parameters.  He  then 
comapred  the  model's  predictions  of  mean  response  time  (broken  down  by  Job 
class),  device  utilizations,  and  total  throughput  with  those  measured  from  the 
running  system.  Typically,  the  models  predictions  were  off  by  less  than  10 
percent,  and  in  many  cases  the  predictions  were  very  accurate. 

To  help  complete  the  picture  without  belaboring  the  point,  let  us 
briefly  consider  some  further  applications.  Queueing  network  models  have  been 
used  to  study  the  performance  of  multiprocessing  systems.  Sauer  and  Chandy 
modeled  general  multiprocessing  systems  (tightly  coupled  systems  such  as 
C.mmp)  to  analyze  the  performance  characteristics  of  such  systems  [Saue79a], 
Specifically,  they  considered  the  effects  on  performance  of  CPU  service 
distributions  and  disciplines,  the  level  of  multiprogramming,  multitasking, 
and  Job  priorities.  In  their  analyses  they  compared  the  performance  of  a 
uniprocessor  with  unit  speed,  to  that  of  a  multiprocessor  having  N  processors 
each  with  speed  1/N  (for  N  =  2,  4,  8).  Considering  how  cheap  microprocessors 
are,  one  would  think  the  multiprocessor  would  be  far  less  expensive  (compare  8 
Intel  8086 's  to  an  IBM  mainframe).  Sauer  and  Chandy  included  in  their  model  a 
performance  reduction  factor  based  on  the  work  of  Fuller  (he  found  that  the 
degradation  in  performance  caused  by  the  contension  of  processors  for  memory 
was  less  than  10  percent  for  actual  and  proposed  C.mmp  configurations). 
Basically,  Sauer  and  Chandy  found  that  given  a  sufficiently  high  multiprogram¬ 
ming  level,  that  the  multiprocessing  systems  could,  even  using  a  simple 
scheduling  strategy  (FCFS),  obtain  system  throughputs  close  to  those  obtained 
by  the  uniprocessor  system  (in  the  range  of  70  to  100  percent). 

Finally,  an  interesting  application  of  queueing  network  models  was  done 
by  Browne,  Chandy,  and  four  other  consultants,  in  the  development  of  the  Air 
Force's  Advanced  Logistics  System  (a  large  data  management  system)  [Saue8l], 
The  queueing  network  model  was  composed  of  four  submodels:  one  for  the  CPU's 
(2  Cyber  70's),  one  for  the  memories  both  private  and  shared  (million  words). 
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one  for  the  database  disks  (100  disks),  and  one  for  the  system/scratch  disks 
and  tape  drives  (8  disks  and  2H  tape  drives).  The  model  predicted  that  the 
proposed  system  was  inadequate  because  of  insufficient  capacity  in  the 
system/scratch  disk  subsystem  and  in  the  CPU's.  Both  of  these  predictions 
were  confirmed  by  subsequent  operational  experience  and  measurement. 
Amazingly,  the  entire  modeling  effort  required  only  two  months  for  the  six 
consultants  to  complete. 

F.3.2  Application  Is  Distributed  Processing  Systaaa 

Currently  distributed  processing  systems  are  generating  much  research 
interest,  and  rightly  so.  They  potentially  provide  for  high  system 
availability,  reliability,  and  performance, and  for  incremental  growth  and  con¬ 
figuration  flexibility  [Ensl78].  This  flexibility  provides  for  many  degrees 
of  freedom  in  the  design  process.  Because  of  this,  modeling  of  the  per¬ 
formance  of  distributed  processing  systems  becomes  very  important.  Within  the 
framework  of  the  ISO  Open  System  Interconnection  Architecture  several  design 
decisions  need  to  be  made.  Many  of these  decisions  will  have  a  significant 
impact  on  the  performance  of  a  distributed  system  [Tane8l], 


Basically,  queueing  network  models  are  used  in  two  types  of  studies  of 
distributed  processing  systems  [Wong78],  The  first  type  of  study  is  directed 
at  the  communication  subnetwork,  while  the  second  is  directed  at  the  user- 
resource  network. 

Performance  studies  of  the  communication  subnetwork  are  concerned  with 
the  delivery  of  messages.  Performance  measures  of  importance  here  are  message 
end-to-end  delay,  message  throughput,  and  channel  utilizations.  Three  design 
areas  are  involved  in  these  studies.  First,  the  system  configuration  (assum¬ 
ing  a  given  topology)  may  be  modeled  to  answer  questions  such  as  what  capacity 
channels  to  use  and  how  many  message  buffers  to  provide.  Second,  the  basic 
control  algorithms  of  the  network  layer  such  as  routing,  congestion  control, 
and  access  protocols  can  be  modeled  and  analyzed  to  determine  the  best  network 
control  strategy.  Finally,  at  the  transport  layer,  flow  control  and  virtual 
circuit  path  selection  can  be  modeled  to  address  end-to-end  concerns. 

As  an  example  of  such  a  study,  let  us  consider  a  model  of  a  message 
switched  network  (note:  packets  can  be  regarded  as  small  messages)  [Wong78]. 
The  type  of  communication  subnetwork  considered  in  this  study  consists  of 
several  intermediate  processors  (IMP's)  connected  by  communication  channels 
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(C's).  The  intermediate  processors  are  responsible  for  the  usual  store  and 
forward  operations  of  message  buffering  and  outgoing  channel  selection,  while 
the  channels  are  responsible  for  transmitting  messages  form  IMP  to  IMP. 

The  subnetwork  is  modeled  as  an  open  BCMP  queueing  network,  where  mes¬ 
sages  originating  from  user  terminals  and  host  computers  move  from  source  to 
destination  by  successively  queueing  for  service  at  the  two  types  of  nodes. 
One  type  of  node  represents  the  intermediate  processors,  while  the  other 
represents  the  channels.  For  example,  consider  a  piece  of  the  queueing 
network  consisting  of  three  intermediate  processors  (one  having  external 
arrivals),  and  the  channels  connecting  them. 
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As  a  first  order  approximation,  it  is  assumed  that  the  queueing  delays 
and  service  times  at  the  intermediate  processors  are  negligible.  This  is  done 
in  Wong's  model  (based  on  Kleinroek's  work  [Klei76])  by  letting  the  IMP  nodes 
have  no  queueing  and  zero  service  time  (essentially  these  nodes  carry  out 
Instantaneous  routing)  [Wong78].  The  servioe  time  at  each  of  the  channels  in 
this  model  is  given  by  the  message  length  divided  by  the  channel  capacity.  In 
addition,  the  messages  are  classified  according  to  their  source  and  destina¬ 
tion  IMP'S.  The  routing  of  the  messages  at  the  IMP'S  is  done  on  the  basis  of 
their  olass  and  oan  be  either  random  or  fixed.  For  the  sake  of  tractability 
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some  further  assumptions  are  needed:  The  external  arrivals  are  Poisson,  all 
message  classes  have  the  same  exponential  message  length  distribution,  the 
queues  are  unbounded,  the  discipline  is  FIFO,  and  Kleinrock's  independence 
assumption  holds  (an  approximation  stating  that  each  time  a  message  enters  a 
node  its  length  is  redrawn  from  the  exponential  distribution). 

Kleinrock  solved  this  model  for  mean  end-to-end  delay,  and  plotted  end- 
to-end  delay  versus  throughput.  He  found  that  the  delay  was  small  until  the 
system  was  operated  at  near  full  capacity  (i.e. ,  one  or  more  channels  near 
saturation),  at  which  point  the  delay  increased  exponentially.  Wong  extended 
Kleinrock’s  solution  by  solving  for  the  probability  distribution  of  end-to-end 
delay  for  each  message  class  (this  allowed  variances  and  percentiles  to  be 
computed).  In  a  model  validation  study  Kleinrock  extended  this  model  to 
include  the  processing  time  of  the  IMP’S,  propagation  delays,  and  other 
features  pertinent  to  the  ARPA  network.  He  found  that  the  mean  delay  cal¬ 
culated  from  the  model  was  73  msec.,  while  that  derived  from  the  measurement 
data  was  93  msec.  This  is  a  discrepancy  of  21.5  percent,  not  unreasonable 
considering  the  complexity  of  the  ARPA  network. 

Many  extensions  to  this  basic  model  have  been  seen  in  the  literature. 
Wong  extended  the  model  to  consider  the  problem  of  buffer  management  using  a 
finite  buffer  model  [Wong78],  Wong  has  also  modeled  end-to-end  flow  control. 
Samari  and  Schneider  extened  Kleinrock's  model  by  considering  delays  and  ser¬ 
vice  times  associated  with  the  IMP'S,  and  by  including  a  correction  factor  to 
account  for  the  nonexponential  nature  of  the  interarrival  time  of  input  to  the 
channels  [Sama80].  They  tested  their  model  against  a  simulation  model  and 
found  that  they  differed  by  less  than  7  percent  (note  the  analytic  model  required 

far  less  computer  time).  Kurinckx  and  Pujolle  applied  a  similar  model  (where 
the  nodes  were  IMP's)  to  study  the  end-to-end  control  through  virtual  circuits 
in  a  computer  network  built  following  the  X.25  Recommendations  [Kuri80],  They 
were  particularly  interested  in  determining  the  maximum  buffer  overallocati^n 
for  a  given  probability  of  overflow. 

Turning  now  to  studies  of  the  user- resource  network,  we  are  now  concer¬ 
ned  with  the  performance  of  higher  level  services.  This  corresponds  to  the 
application  layer  in  the  Open  System  Architecture.  Some  of  the  problems  here 
are  concerned  with  which  processes  to  run  where,  and  where  to  place  data. 
Finding  optimal  (or  near  optimal)  solutions  to  these  problems  can  greatly  help 


Page  126 


Queueing  Network  Models 


Appendix  F 


system  performance.  The  implementation  of  these  solutions  would  be  in  the 
system's  distributed  operating  system  and/or  distributed  database  system. 

As  an  example  consider  the  problem  of  scheduling  a  set  of  processes  on  a 
fully  distributed  processing  system.  An  ideal  system  level  scheduler  should 
have  knowledge  of  the  communication  needs  of  the  processes  in  the  system  and 
the  status  of  the  processors  in  the  system.  A  particular  concern  would  be 
that  of  efficiently  scheduling  distributed  programs  such  as  a  distributed  com¬ 
piler  [M11182],  so  as  to  minimize  its  communication  waiting  delays. 
Currently,  scheduling  as  complex  as  this  has  not  been  modeled.  However,  Chou 
and  Abraham  modeled  system  level  scheduling  stochastically  [Chou82].  They 
considered  the  problem  of  scheduling  a  set  of  processes  (or  tasks)  on  a  set  of 
heterogenous  processors.  They  presented  an  algorithm  that  optimally  assigns 
tasks  to  processors,  which  is  based  on  Markov  decision  theory. 

Finally,  in  an  attempt  to  determine  the  overall  performance  of  a 
distributed  processing  system,  Vong  combined  his  communication  subnetwork 
model  with  a  model  of  a  simple  user-resource  component  consisting  of  a  set  of 
remote  terminals  and  a  single  host  with  local  terminals  [Wong78],  Further 
assumptions  for  this  model  are:  The  CPU  discipline  is  processor  sharing,  and 
the  think  time  and  CPU  service  time  have  rational  Laplace  transforms.  Since 
this  is  a  BCMP  qeueing  network,  Wong  found  an  exact  solution  for  the  mean 
response  time  for  local  and  remote  users.  He  plotted  the  mean  response  times 
versus  the  number  of  local  users  for  various  numbers  of  remote  users.  As 
expected,  beyond  a  certain  threshold  the  mean  response  time  increases  linearly 
with  the  number  of  local  users. 

Other  than  the  studies  of  the  performance  of  the  communications  sub¬ 
network,  there  have  been  few  analytic  models  of  distributed  processing 
systems.  In  particular,  more  work  needs  to  be  done  in  modeling  the  higher 
level  services  such  as  system  level  scheduling,  and  in  deve]  •i-.g  integrated 
models  that  take  into  account  both  the  characteristics  of  the  communication 
subnetwork  and  the  characteristics  of  the  various  resources  connected  to  the 
subnetwork  (e.g. ,  processors  along  with  their  memories,  terminals,  secondary 
storage  devices,  and  other  peripherals) . 
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APPENDIX  G 

THE  DESIGN  AND  EVALUATION  OF  A  DISTRIBUTED  COMPILER 


John  A.  Miller 
Richard  J.  LeBlanc 

G.1  imQPPCIIQN 

The  increasing  availability  of  distributed  computing  systems  connected 
by  local  area  networks  has  produced  interest  in  the  application  of  distributed 
computing  to  software  traditionally  run  on  uniprocessors.  The  principal 
motivation  for  such  application  is  to  attempt  to  decrease  the  response  time  of 
programs  by  partitioning  them  into  components  which  can  be  executed  in  paral¬ 
lel.  This  paper  describes  an  experiment  which  tested  the  feasibility  of 
implementing  a  compiler  as  a  distributed  program.  It  3hould  be  noted  that 
this  study  is  intended  only  as  an  initial  examination  of  this  problem.  The 
results  are  somewhat  dependent  on  the  hardware  configuration  on  which  the 
study  was  conducted  and  on  the  nature  of  the  task  performed  by  a  compiler. 
However,  some  generalized  conclusions  can  be  drawn  from  our  experience. 

To  carry  out  this  experiment,  we  constructed  two  versions  of  a  compiler, 
a  distributed  version  and  a  single-pass  version.  We  then  compared  the 
response  times  of  the  two  compilers  on  test  programs  of  various  sizes.  It  was 
our  hope  that  the  distributed  version  would  show  a  significant  improvement  in 
response  time  due  to  its  utilization  of  parallelism  inherent  to  the  compila¬ 
tion  process. 

The  experiment  was  performed  using  the  facilities  of  the  computing 
laboratory  of  the  School  of  Information  and  Computer  Science  at  Georgia  Tech. 
The  distributed  system  available  included  a  network  of  five  Prime  computers, 
two  Prime  P550's  and  three  Prime  PMOO’s.  The  computers  are  interconnected  by 
Ringnet  [Gord79],  a  packet  switched  communication  system.  Ringnet  is  a  subset 
of  PRIMENET  (PRIMENET  refers  to  all  Prime’s  networking  products)  that  deals 
with  local  area  communication.  Ringnet  is  an  unidirectional  loop  network  that 
consists  of  a  node  controller,  a  coaxial  transmission  cable  that  provides  an  8 
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Mbits/sec  effective  bit  rate,  and  interfaces  to  the  transmission  cable  at  each 
node. 

A  group  at  the  Illinois  Institute  of  Technology  have  done  considerable 
research  related  to  this  work.  They  first  implemented  a  distributed  compiler 
for  the  language  DYNAMO  on  their  network  computer,  known  as  TECH NEC  [Huen77]. 
TECHNEC  is  a  network  of  several  LSI-11  computers  that  work  together  to  execute 
a  single  job  at  a  time.  More  recently  they  have  reported  work  on  a 
distributed  Pascal  compiler  for  TECHNEC  [E1-D79].  Their  work  has  included 
considerations  of  automatic  partitioning  of  object  code  as  well  as  attempts  to 
distribute  the  executions  of  a  compiler.  Our  work  is  related  to  the  latter  of 
these  efforts.  This  report  goes  beyond  their  publications  by  presenting  a 
comparison  of  the  performance  of  distributed  and  standard  compilers. 

G.2  THE  COMPILER 
G.2.1  Language 

The  programming  language  used  for  this  study  was  a  subset  of  Pascal, 
called  Jigsaw,  used  in  compiler  writing  courses  at  Georgia  Tech.  This 
language  was  chosen  because  it  is  simple  enough  to  keep  the  compiler  develop¬ 
ment  time  as  short  as  possible,  yet  it  contains  enough  features  to  present 
"typical"  compilation  problems.  The  features  of  Jigsaw  include:  if  and  while 
control  statements,  integer,  real,  array,  and  record  data  types,  and 
parameterized  procedures. 

G.2. 2  Conponenta  siL  ihs.  Compiler 

The  process  of  partitioning  the  problem  is  of  paramount  importance  in 
implementing  a  distributed  program.  Ideally,  the  component  parts  of  a 
distributed  program  should  each  implement  a  single  step  of  the  task  being  per¬ 
formed.  More  importantly,  the  interaction  between  the  components  should  be 
simple  and  infrequent  (relatively  speaking).  Such  a  partitioning  will  result 
in  components  which  can  easily  be  connected  as  a  pipeline.  The  Jigsaw  com¬ 
piler  was  partitioned  into  the  following  components  (all  written  in  Pascal): 
a  lexical  analyzer,  a  syntactic  analyzer,  and  a  semantic  analyzer/code 
generator.  A  separate  code  generator  would  usually  be  needed,  but  the  target 
code  (for  a  hypothetical  machine)  was  sufficiently  similar  to  intermediate 
code  (e.g. ,  quadruples)  to  make  a  separate  code  generation  step  unnecessary. 
These  components  work  together  in  series  to  decompose  source  statements, 
analyze  their  contents  and,  finally,  compose  the  target  or  object  code.  The 
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process  is  analogous  to  an  assembly  line,  where  the  product  goes  through 
different  phases  on  its  way  to  completion. 

G.2.2.1  Lexical  Analyzer 

The  lexical  analyzer  or  scanner  is  the  first  phase  of  the  compilation 
process.  Its  function  is  to  read  lines  from  the  source  code  file  and  break 
them  up  into  their  component  symbols,  called  tokens.  The  tokens  of  a  program¬ 
ming  language  are  analogous  to  the  words  and  punctuation  marks  of  English 
text.  The  extraction  of  tokens  from  the  input  line  is  accomplished  by  using  a 
finite  state  machine  that  does  pattern  matching  on  the  characters  in  the  line. 
When  it  finds  a  valid  pattern  it  outputs  a  number  that  specifies  the  class  to 
which  the  token  belongs  (e.g.  identifiers  make  up  a  class),  and,  where 
additional  semantic  information  is  necessary,  it  outputs  the  token  string 
itself.  For  example,  when  the  token  *123'  is  found,  the  lexical  analyzer  out¬ 
puts  the  class  number  for  integer  constants  and  the  string  '123',  so  that  the 
constant  value  may  be  computed  later.  Thus  the  lexical  analyzer  performs  the 
mapping  shown  in  Figure  1(a). 

The  lexical  analysis  component  was  constructed  using  a  lexical  analyzer 
generator,  a  relatively  common  compiler  development  tool  available  on  our  com¬ 
puting  system.  This  tool  consists  of  a  table-driven  lexical  analyzer  and  a 
lexical  table  generator.  To  use  this  tool,  one  specifies  the  tokens  of  the 
language  as  regular  expressions.  The  generator  program  reads  this  information 
and  produces  a  table  that  is  used  by  the  lexical  analyzer  to  make  decisions  in 
the  pattern  matching  process.  This  standard  lexical  analyzer,  which  uses  the 
generated  table,  is  easily  incorporated  in  a  compiler. 

G.2.2.2  Syntactic  Analyzer 

The  second  phase  of  the  compilation  process  is  performed  by  the  syntac¬ 
tic  analyzer  or  parser.  It  takes  the  token  numbers  generated  by  the  lexical 
analyzer  and  collects  them  together  to  form  phrases  that  are  specified  by 
grammatical  rules.  This  grouping  of  tokens  into  phrases  is  accomplished  by  a 
pushdown-store  machine.  When  it  determines  that  a  string  of  tokens  satisfies 
a  grammatical  rule,  it  replaces  that  string  with  a  nonterminal  symbol  that 
stands  for  a  string  of  that  type.  In  addition,  for  certain  rules,  actions 
must  be  carried  out  to  manipulate  semantic  information  or  generate  some  form 
of  code.  These  actions  are  specified  by  "action  numbers"  attached  to  these 
rules  in  the  grammatical  specification.  Thus  the  syntactic  analyzer  performs 
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the  mapping  illustrated  by  Figure  1(b). 

We  made  use  of  another  common  tool,  a  LALR(1)  parser  generator,  to 
construct  the  syntactic  analysis  component.  It  consists  of  a  table-driven 
parser  and  a  parsing  table  generator.  To  use  the  generator,  one  simply  writes 
a  specification  of  the  grammar  for  the  language  in  Backus-Naur  Form  (BNF). 
The  generator  will  produce  tables  from  this  information  which  will  be  used  by 
the  parser  to  make  parsing  decisions. 

G.2.2.3  Semantic  Analyzer 

In  this  implementation,  the  semantic  analyzer  constitutes  the  third  and 
final  phase  of  the  compilation  process.  Its  basic  function  is  to  implement 
the  "semantics"  (meaning)  of  the  program.  It  is  driven  by  the  action  numbers 
generated  by  the  parser.  These  semantic  action  numbers  have  routines 
associated  with  them  that  manipulate  information  on  a  semantic  stack  and/or 
generate  some  form  of  code  (which  in  this  compiler  is  the  target  code). 

Some  of  the  semantic  actions  require  information  from  the  lexical 
analyzer.  For  example,  one  semantic  action  specifies  that  an  identifier 
should  be  pushed  onto  the  semantic  stack.  The  identifier,  in  the  form  of  a 
token  string,  is  obtained  directly  from  the  lexical  analyzer.  Thus  the 
semantic  analyzer’s  task  is  described  by  the  diagram  in  Figure  1(c). 

No  tools  were  available  to  automatically  construct  this  component.  Thus 
the  semantic  analyzer  was  entirely  hand-coded.  It  consists  of  action  routines 
to  carry  out  semantic  actions,  symbol  table  routines  to  store  attributes  of 
identifiers  and  code  generation  routines  to  output  the  simple  target  code. 

G.2.3  lbs.  DJLatritmted  foapllsr 

Having  defined  these  basic  components,  we  next  consider  the  task  of  put¬ 
ting  them  together  to  form  a  complete  compiler.  Because  of  the  high  degree  of 
modularity  and  the  simple  interfaces,  an  obvious  way  to  have  the  components 
work  together  is  to  let  them  be  separate  processes  that  communicate  with  each 
other  by  sending  messages.  This  approach  was  adopted  to  form  a  distributed 
Jigsaw  compiler.  The  communication  links  for  this  compiler  are  those  that  are 
formed  by  fitting  the  previous  functional  mapping  diagrams  together.  Figure 
1(d)  illustrates  the  final  structure. 

Notice  that  the  components  fit  together  in  a  pipeline  fashion,  where  the 
input  goes  through  successive  transf ormations  on  its  way  to  the  finished 
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product,  namely  the  target  code.  Thus  the  only  interdependency  is  that 
processes  must  be  fed  information  from  their  predecessors.  This  enables  the 
processes  to  be  implemented  efficiently  as  communicating  distributed  proces¬ 
ses,  where  each  process  runs  on  a  separate  computer  in  the  aforementioned 
local  area  network. 

Ideally,  we  would  like  a  system  where  each  component  could  continue  run¬ 
ning  as  long  as  it  had  input  to  process.  For  instance,  the  lexical  analyzer 
could  run  continuously  and  keep  sending  out  token  numbers  and  strings  until  it 
encountered  the  end  of  the  input  file.  Its  output  would  be  accumulated  in  the 
message  queues  at  the  syntactic  and  semantic  analyzers.  However,  since  the 
message  queues  are  finite,  if  the  lexical  analyzer  runs  faster  than  the  other 
two  components  it  will  eventually  be  forced  to  wait  for  them,  thus  destroying 
the  valuable  inherent  parallelism.  Therefore,  steps  need  to  be  taken  to  tune 
or  optimize  the  performance  of  these  cooperating  processes. 

There  are  three  basic  factors  to  be  taken  into  consideration.  First, 
the  speeds  of  components  need  to  be  balanced.  Assume  that  the  total  amount  of 

time  required  by  the  components,  is  t-|,  t2,  and  tg  respectively,  where  T  =  ti 
+  t2  +  t.3  equals  the  total  compilation  time  for  a  serial  implementation.  Then 
the  maximum  possible  speed  up  factor  would  be 

T 

max  {t1(t2>t3j 

Clearly,  the  best  we  can  do  is  have  ti  =  t2  =  t3»  in  which  case  fs  =  that 
is,  the  parallel  version  would  potentially  be  3  times  faster  than  the  serial 

version.  Notice  that  this  factor  needs  to  be  considered  in  the  initial 
partitioning  of  the  problem.  A  compromise  may  be  required  between  the  goal  of 
balancing  and  that  of  conceptual  separability  or  modularity.  In  the  case  of 
our  compiler,  we  discovered  that  there  was  little  conflict  between  these 
goals. 

The  second  factor  to  be  considered  is  the  size  of  what  we  called  the 
"intervals  of  independence".  These  intervals  refer  to  the  amount  of  time  the 
component  processes  can  run  independently  (that  is,  without  sending  or  receiv¬ 
ing  messages).  They  are  important  because  communication  and  waiting  delays 
are  avoided  during  these  intervals.  This  argues  for  us  making  the  intervals 
large.  However,  making  the  intervals  too  large,  makes  the  time  to  fill  the 
pipeline  significant,  and  may  result  in  components  having  to  wait  too  long  for 
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their  input.  Basically,  we  are  trying  to  optimize  between  two  separate  types 
of  serialization.  The  first  type  of  serialization  is  illustrated  by  the  time 

diagram  for  a  single-pass  compiler  in  Figure  2(a),  where  dti ,  dt2,  and  dt3  are 
the  single  step  processing  times  for  the  lexical,  syntactic,  and  semantic 

analyzers,  respectively.  The  second  type  is  illustrated  by  a  hypothetical 
multi-pass  compiler  that  communicates  using  memory  rather  than  secondary 
storage.  The  time  diagram  for  it  would  look  like  the  one  in  Figure  2(b). 

The  ideal  design  of  a  distributed  compiler  would  result  in  overlapping 
execution  of  the  three  components,  as  diagrammed  in  Figure  2(c).  If  the 
intervals  of  independence  are  too  small,  then  the  behavior  will  approach  that 
of  a  single  pass  compiler,  where  for  example,  the  syntactic  analyzer  would 
wait  for  a  message  from  the  lexical  analyzer,  quickly  do  its  processing,  send 
results  to  the  semantic  analyzer  (waiting  if  its  queue  is  full),  and  finally 
go  back  to  waiting  for  a  message  from  the  lexical  analyzer.  Making  the  inter¬ 
vals  of  independence  larger  provides  the  advantage  that  more  processing  will 
be  done  within  each  interval  relative  to  the  amount  of  time  spent  waiting  and 
transmitting  messages.  However,  making  the  intervals  too  large  will  result 
in,  say,  the  syntactic  analyzer  having  to  wait  too  long  to  get  information 
from  the  lexical  analyzer  before  it  could  proceed.  Clearly  the  optimal  solu¬ 
tion  depends  on  the  characteristics  of  the  network,  the  computers,  the  operat¬ 
ing  system  and  the  individual  processes  themselves. 


,\ 


»  #  * 


[v.\ 
£ 


m 


Eft 


m 


A  simple  way  to  control  the  size  of  the  intervals  is  to  adjust  the 
amount  of  information  sent  in  each  message.  For  example,  the  lexical  analyzer 
sends  token  numbers  to  the  syntactic  analyzer  and  token  strings  to  the 
semantic  analyzer.  The  lexical  analyzer  saves  these  numbers  and  strings  in 
internal  buffers.  Only  when  it  has  filled  one  of  the  buffers,  does  it  send  a 
message  (the  contents  of  the  full  buffer)  to  one  of  the  other  processes.  This 
buffering  mechanism  enables  the  intervals  of  independence  to  be  increased  and 
the  number  of  individual  messages  passed  to  be  reduced. 


The  third  and  final  factor  to  be  considered  is  that  of  balancing  the 
flow  of  messages  between  the  processes.  That  is,  for  each  communication  link, 
we  want  the  number  of  messages  that  are  sent  to  be  approximately  equal.  As  an 
example,  again  consider  the  lexical  analyzer.  Observation  shows  it  sends  more 
than  twice  as  many  token  numbers  as  it  does  token  strings.  Suppose  it  has 
Just  filled  its  buffers  and  sent  them  out. 
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N: 

token 

number 

buffer 

n1  ri2 

n3  ru<  n5  n6 

S: 

token 

string 

buffer 

si  S2 

S3 

The  syntactic  analyzer  will  receive  the  N  buffer,  process  it  and  eventually 
send  out  a  semantic  action  number  buffer. 


A:  action  number  buffer  a,  &2  &3  al>  &5  ag 

Now,  if  the  token  numbers  that  correspond  to  the  token  strings  in  S  were  in  N, 
then  action  numbers  that  tell  what  to  do  with  the  strings  in  S  will  be  in  A. 
Thus  balancing  will  result  in  smooth  information  flow  where  waiting  times  will 
be  small. 


As  an  example  of  what  can  happen  when  the  flows  are  not  balanced,  assume 
the  message  queue  size  equals  2  (as  is  the  case  on  our  system),  and  that  the  N 
buffer  holds  8  elements  and  the  S  buffer  holds  1 .  After  about  7  tokens  have 
been  read,  3  token  strings  will  have  been  sent.  However,  since  no  token  num¬ 
bers  have  been  sent,  the  token  strings  will  not  have  been  received,  so  that 
the  lexical  analyzer  will  be  blocked  indefinitely  waiting  to  send  the  third 
token  string.  Thus,  as  this  extreme  case  illustrates,  unbalanced  message 
flows  can  even  result  in  deadlock. 

Having  already  taken  care  of  the  first  factor,  we  were  left  with  the 
task  of  choosing  the  buffer  sizes  to  optimize  the  second  and  third  factors. 
We  first  attacked  the  third  factor  by  setting  the  S  buffer  size  at  10  token 
strings,  and  testing  the  response  times  for  various  N  and  A  buffer  sizes.  For 
test  programs  of  600  lines  of  code,  the  responses  for  N,A  =  20  ±  2  were  1:02 
minutes  and  the  response  increased  slowly  as  N,A  moved  outside  this  range. 
Thus  the  optimal  ratio  of  buffer  sizes,  N:S:A,  was  about  2:1:2.  We  then 
attacked  the  second  factor  by  holding  this  ratio  fixed  and  varying  the 
magnitude  of  the  buffers.  For  test  programs  of  600  lines  of  code,  the 
response  times  and  processor  times  are  reported  in  Table  1 . 

Since  our  network  limited  messages  to  at  most  256  bytes,  we  were  unable 
to  test  larger  buffers  (20  token  strings  requires  240  bytes).  The  data  in  the 
table  clearly  show  that  increasing  the  size  of  the  buffers  when  the  buffers 
are  small  provides  dramatic  improvements.  However,  once  the  buffer^  sizes 
(N,S,A)  reach  (12,6,12),  the  response  curve  becomes  rather  flat  and  remains  so 
through  the  rest  of  the  range  tested.  We  thus  picked  the  minimum  point  of 
this  curve  as  the  values  to  set  the  buffer  sizes  for  the  distributed  compiler, 
that  is  we  let 
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N  hold  20  token  numbers  (2  bytes/number), 

S  hold  10  token  strings  (12  bytes/string) , 

A  hold  20  action  numbers  (2  bytes/number). 

G.2.4  Single-Pass  Version 

A  single-pass  version  of  the  compiler  was  also  constructed  to  be  used  as 
a  standard  of  comparison  in  evaluating  the  performance  of  the  distributed  com¬ 
piler.  It  uses  the  exact  same  components  as  the  distributed  version. 
However,  instead  of  having  them  communicate  by  sending  messages,  they  com¬ 
municate  using  procedure  invocation,  with  the  syntactic  analyzer  acting  as  the 
driver.  When  it  needs  a  token  it  calls  the  lexical  analyzer  and  similarly 
when  it  has  determined  that  a  semantic  action  needs  to  be  performed  it  calls 
the  semantic  analyzer.  Thus  the  single-pass  version  is  implemented  as  a 

sequential  process  that  runs  on  a  single  computer. 

0.3  THE  EXPERIMENT 

The  point  of  this  case  study  was  to  test  the  feasibility  of  distributed 
compilation.  As  described  above,  distributed  and  single-pass  versions  of  the 
same  compiler  were  constructed,  differing  only  in  global  control  structures. 
Thus  the  only  factor  which  could  account  for  any  performance  differences  is 
the  method  of  communication  between  the  components  and  the  parallelism  it 
allows.  The  distributed  version  communicates  by  sending  messages;  the  single¬ 
pass  version  communicates  by  procedure  invocation  and  parameter  passing. 

Therefore,  if  we  consider  the  total  amount  of  processing  time  consumed  by  the 
compilers  in  compiling  the  same  program,  it  seems  likely  that  the  distributed 
version  would  require  a  little  more  time,  as  message  passing  requires  more 
overhead  than  procedure  invocation.  However,  this  factor  will  be  unimportant 
if  significant  parallelism  can  be  achieved  in  the  distributed  version,  thereby 
substantially  reducing  total  response  time.  Thus  we  will  compare  the  response 
time  of  the  distributed  version  to  that  of  the  single  pass  version. 

For  this  experiment  three  Prime  computers. in  our  local  area  network  were 

used,  two  Prime  P550’s,  systems  A  and  B,  and  one  Prime  P400,  system  C.  These 

systems  are  compatible  with  respect  to  machine  instructions  and  operating 
system,  but  system  C  is  a  little  slower.  The  components  of  the  distributed 
compiler  were  placed  as  follows:  lexical  analyzer  on  system  A,  syntactic 

analyzer  on  system  B,  and  the  semantic  analyzer  on  system  C.  The  single-pass 
compiler  was  run  on  system  A. 
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Specifically,  the  following  tests  were  performed.  First  the  two  com¬ 
pilers  were  tested  under  completely  unloaded  conditions;  that  is,  the  only 
other  load  on  the  system  was  due  to  the  operating  system.  These  conditions 
are  of  interest  as  they  indicate  the  maximum  possible  benefit  that  can  be 
achieved  by  distributing  a  compiler.  For  this  case,  the  two  compilers  were 
run  on  Jigsaw  test  programs  ranging  in  size  from  25  to  1200  lines  of  code. 
For  each  program  the  response  time  and  processor  (cpu)  times  used  were  recor¬ 
ded.  The  results  are  shown  in  Table  1  and  Figure  1.  All  times  in  the  tables 
are  in  units  of  seconds,  except  for  the  longer  response  times,  which  are 
expressed  as  'minutes : seconds’ . 

Secondly,  the  compilers  were  tested  under  moderately  loaded  conditions, 
where  approximately  five  people  were  using  each  system.  Although,  this  does 
not  constitute  a  well  controlled  experiment,  it  does  give  an  indication  of  the 
trend  in  the  response  times  as  the  load  factor  is  increased.  The  results  for 
this  case  are  shown  in  Table  2. 

G.4  INTERPRETATION 

The  data  reported  in  Table  1  and  Figure  1  clearly  indicate  that 
distributed  compilers  can  achieve  significant  improvements  in  response  time 
over  traditional  single-pass  compilers.  Indeed,  for  programs  of  more  than  100 
lines  of  code  the  distributed  compiler  was  2  to  2.5  times  faster  than  the 
single-pass  compiler.  For  example,  for  a  program  of  1200  lines,  the  single¬ 
pass  compiler  took  4  1/2  minutes,  while  the  distributed  compiler  took  only  2 
minutes.  This  ratio  obviously  would  have  a  considerable  impact  when  compiling 
even  larger  programs.  For  programs  smaller  than  100  lines,  we  see  that  use  of 
the  distributed  version  is  still  advantageous,  although  less  overwhelmingly. 
This  loss  of  advantage  can  be  accounted  for  by  the  fact  that  the  distributed 
version  has  fixed  overhead  involved  in  setting  up  the  virtual  circuits  and 
filling  the  pipeline  (i.e.,  the  syntactic  analyzer  cannot  start  until  the 
lexical  analyzer  has  filled  a  buffer  with  token  numbers). 

The  message  buffer  sizes,  used  to  control  the  frequency  of  interactions 
between  components  of  the  program,  turned  out  to  be  a  very  important  per¬ 
formance  factor.  Without  buffering,  the  best  speed-up  factor  obtained  was 
about  1.4  (as  opposed  to  2.5  with  the  reported  buffer  sizes).  As  buffering 
was  introduced  and  the  sizes  increased,  performance  at  first  improved  rather 
dramatically.  The  sizes  used  reflect  a  leveling  off  point  in  a  graph  of  per- 
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formance  versus  buffer  sizes. 

Let  us  now  examine  the  relationship  between  the  performance  of  the 
individual  components  and  that  of  the  distributed  compiler.  These  components 
are  roughly  equal  in  the  processing  time  that  they  consume,  with  the  syntactic 
analyzer  consuming  the  largest  portion.  Since  tne  maximum  potential  speed-up 

factor  fs  United  by  the  slowest  component,  it  is  very  important  in 
distributed  programs  to  concentrate  performance  improvement  efforts  on  such 
components.  Note  that  the  effects  of  improving  the  slowest  component  are  much 
more  dramatic  with  distributed  programs.  Speeding  up  the  parser  in  our  com¬ 
piler  by  15$  would  probably  improve  the  performance  of  the  distributed  version 
by  close  to  15$,  but  it  would  improve  the  single-pass  version  by  less  than 
half  of  that  factor. 

To  determine  the  amount  of  processing  time  needed  to  distribute  the  com¬ 
piler,  we  can  compare  the  processing  time  used  by  the  single-pass  version  with 
that  of  the  sum  of  the  processing  times  of  the  distributed  components.  The 
difference  between  the  sum  column  in  Table  1  and  the  processor  time  column 
seems  to  be  made  up  of  two  components:  a  fixed  overhead  of  about  3  seconds, 
and  a  proportionate  increase  of  about  5$.  The  fixed  overhead  results  from  the 
time  necessary  to  initially  set  up  the  virtual  circuits,  and  the  proportionate 
increase  is  caused  by  the  replacement  of  procedure  invocation  with  message 
passing.  Compared  with  the  positive  effects  of  parallelism,  these  negative 
effects  are  not  significant. 

Finally,  we  consider  the  data  from  the  tests  where  the  system  was 
moderately  loaded.  Again  the  distributed  version  was  faster,  but  the  speed-up 
factor  was  much  smaller,  about  1.5.  Thus  it  is  apparent  that  the  distributed 
compiler  was  more  adversely  affected  by  the  load  on  the  system  than  the 
single-pass  version.  This  result  is  expected,  since  the  speed  of  the 
distributed  version  depends  on  the  smooth  flow  of  information  between  the 
processes  and  loading  the  system  increases  the  competition  for  time  slices, 
thereby  increasing  the  probability  that  a  message  will  be  sent  to  a  process  in 
a  wait  state.  Hence,  loading  the  system  has  the  potential  to  increase  effec¬ 
tive  message  transmission  time  and  thus  slow  down  the  individual  components. 
A  possible  remedy  for  this  problem  would  be  to  have  a  sophisticated 
distributed  operating  system  to  oversee  the  operation  of  the  network.  If  such 
an  operating  system  had  knowledge  of  the  running  characteristics  and  the  com- 
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munication  needs  of  the  processes  in  the  network,  then  it  could  possibly 
schedule  processes  so  as  to  enhance  the  smooth  flow  of  information. 

G.5  fflfflflJBlQtt 

The  significance  of  this  study  is  not  merely  that  it  demonstrates 
potential  benefits  of  distributed  compilation,  but  rather  that  it  suggests 
that  at  least  some  class  of  programs  traditionally  executed  sequentially  can 
be  successfully  partitioned  as  distributed  programs.  We  believe  this  class 
includes  not  only  compilers,  but  any  program  which  operates  as  a  sequence  of 
transformations  on  its  input  to  produce  some  output.  Such  programs  map  nicely 
to  distributed  computing  systems  which  provide  a  pool  of  assignable,  general- 
purpose  processors.  In  such  a  system,  computers  could  be  allocated  to  the 
component  processes  of  the  compiler  (or  other  program)  for  the  entire  length 
of  the  compilation,  thus  achieving  the  ideal  conditions  of  the  unloaded  tests. 
It  should,  therefore,  be  possible  to  achieve  speed-up  factors  of  the  magnitude 
we  observed. 

It  should  be  noted  that  it  was  quite  easy  to  transform  a  traditionally- 
structured  compiler  into  a  distributed  one.  Using  message  passing  as  the 
means  for  communication  between  components  requires  only  thoughtful  design  of 
component  interfaces.  No  complex  synchronization  protocols  need  be  devised. 
The  message  passing  corresponds  to  simple  procedure  invocations  in  the 
traditional  program.  Again,  this  should  hold  for  a  broader  class  of  programs. 

We  currently  observe  the  development  of  systems  which  commonly  provide 
conditions  similar  to  those  of  the  unloaded  tests.  With  personal  computers 
and  small  business  systems  becoming  inexpensive,  networks  of  them  are 
proliferating.  In  such  networks,  the  use  of  human  resources  rather  than  the 
use  of  cheap  processors  is  optimized,  so  processors  are,  on  the  average,  ligh¬ 
tly  loaded  and  thus  are  available  for  use  by  distributed  programs.  Another 
reason  why  such  a  system  is  a  good  candidate  is  that  the  processors  are  not 
very  fast,  so  that  use  of  parallelism  is  particularly  desirable.  Furthermore, 
the  memory  capacity  on  these  systems  may  be  limited,  making  a  distributed 
program  advantageous,  since  its  component  processes  are  naturally  smaller  than 
the  entire  program  would  be  if  it  were  monolithic. 

Finally,  we  must  consider  the  system  dependencies  of  our  results.  The 
success  of  the  distributed  compiler  depends  on  message  delay  times  being 
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small.  Its  loss  of  advantage  as  a  load  appeared  on  the  system  is  direct 
evidence  of  this  dependency.  Thus  our  results  are  most  applicable  to  high- 
bandwidth  local  area  networks  which  can  provide  the  necessary  speed  of  message 
delivery.  The  introduction  of  the  concept  of  buffering  messages  within 
program  components  as  a  tuning  technique  makes  our  results  less  dependent  on 
the  more  detailed  system  characteristics.  With  proper  use  of  buffer  sizes,  it 
is  likely  that  our  results  could  be  matched  on  a  variety  of  distributed 
systems  connected  by  local-area  networks. 

G.6  TABLES  AND  FIGURES 

TABLE  1 

BUFFER  SIZE  TEST  RESULTS 


BUFFER  SIZES 

N  S  A 

RESPONSE 

TIME 

(min:seo) 

SCANNER 
CPU  TIME 
(seo) 

PARSER 
CPU  TIME 
(seo) 

SEMANTIC 
CPU  TIME 
(seo) 

1 

1 

1 

2:23 

57 

82 

47 

4 

2 

4 

1:36 

46 

58 

35 

8 

4 

8 

1:13 

42 

55 

32 

12 

6 

12 

1:06 

42 

53 

31 

16 

8 

16 

1:03 

41 

53 

31 

20 

10 

20 

1:02 

40 

52 

31 

24 

12 

24 

1:03 

40 

52 

31 

28 

14 

28 

1:03 

40 

52 

31 

32 

16 

32 

1:04 

39 

51 

31 

36 

18 

36 

1:04 

39 

51 

31 

40 

20 

40 

1:04 

39 

51 

31 
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TABLE  2 

TIMING  DATA  FOR  NUNS  ON  UNLOADED  SYSTEM 


SINGLE  PASS  COMPILER  DISTRIBUTED  COMPILER 


PROGRAM 

SIZE 

(lines) 

RESPONSE 

TIME 

(min:seo) 

PROCESSOR 
CPU  TIME 
(sec) 

RESPONSE 

TIME 

(min:seo) 

SCANNER 
CPU  TIME 
(sec) 

PARSER 
CPU  TIME 
(sec) 

SEMANTIC 
CPU  TIME 
(sec) 

TOTAL  CPU 
TIME 
(sec) 

25 

7 

5 

5 

2 

3 

3 

8 

50 

13 

9 

8 

3 

5 

5 

13 

100 

25 

20 

13 

7 

10 

8 

25 

200 

47 

39 

22 

14 

19 

12 

45 

300 

1 :08 

58 

32 

20 

26 

17 

63 

400 

1 J  31 

78 

42 

27 

35 

22 

84 

5  00 

1:54 

96 

52 

34 

44 

26 

104 

600 

2:17 

115 

1:02 

40 

52 

31 

123 

700 

2:39 

135 

1:12 

47 

61 

36 

144 

800 

2: 59 

154 

1:21 

53 

70 

40 

163 

900 

3:19 

172 

1:30 

60 

77 

45 

182 

1000 

3:42 

192 

1:40 

67 

86 

49 

202 

1100 

4:06 

212 

1:50 

73 

94 

53 

220 

1200 

4:30 

230 

1:59 

79 

102 

58 

239 

TABLE  3 

TIMING  DATA  FOR  RUNS  ON  LOADED  SYSTEM 

SINGLE  PASS  COMPILER  DISTRIBUTED  COMPILER 
PROGRAM  RESPONSE  PROCESSOR  RESPONSE  SCANNER  PARSER  SEMANTIC  TOTAL  CPU 
SIZE  TIME  CPU  TIME  TIME  CPU  TIME  CPU  TIME  CPU  TIME  TIME 

(lines)  (min:see)  (sec)  (mln:seo)  (aeo)  (sec)  (sec)  (sec) 
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FIGURE  1 

COMPILER  STRUCTURE 


characters 


Lexical  I 
Analyzer  I 


->  token  numbers 
•>  token  strings 


(A)  LEXICAL  ANALYZER  TRANSFORMATION 


token  numbers 


syntax  I 
analyzer  I 


— >  action  numbers 


(B)  SYNTAX  ANALYZER  TRANSFORMATION 


action  numbers 
token  strings 


->  |  semantic  I 
->  |  analyzer  I 


•>  instructions 


(C)  SEMANTIC  ANALYZER  TRANSFORMATION 


souroe  I  lexical  I  tokens  |  syntaotlo  I  notions  |  semantic  |  target 
- >  I  analyzer  1 - >  |  analyzer  1 - -->1  analyzer  I - > 

I  t 

I  tokens  strings  I 

(D)  OVERALL  COMPILER  STRUCTURE 
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FIGURE  2 
TIMING  DIAGRAMS 


I - 1  I - 1 

dtf  dti 

I  I - 1  ...  I - 1 

dt2  dt2 

I - 1  I - 1  I - 1 

dt3  dt3  dt3 
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(A)  SINGLE  PASS  COMPILER 


lexical  I - 1 - 1  ...I - 1 

dt1  dti  dti 


syntactic 


semantio 


| - 1 - 1...| - 1 

dt2  dt2  dt2 


| - 1 - 1...| - 1 

dfc3  dt3 _ dt3__> 

time 


lexloal 


syntactic 


semantio 


(B)  MULTI-PASS  COMPILER 


'1  dti 


2  dt2 


■I...I - 1 


•I - I...I - 1 


3  dt3 


(C)  IDEAL  DISTRIBUTED  COMPILER 
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Architecture  for  a  Global  Operating  System 


M.  S.  McKendry,  J.  E.  Allchin,  and  W.  C.  Thibault 


School  of  Information  and  Computer  Science 
Georgia  Institute  of  Technology 
Atlanta,  Georgia  30332 


ABSTRACT 


Global  operating  systems  are  suited  to  distributed, 
local-area  network  environments.  A  decentralized 
global  operating  system  can  manage  all  resources 
globally,  relying  on  functional  requirements  for 
resource  allocations,  rather  than  the  relative  physical 
locations  of  the  resource  allocation  mechanism  and 
the  resource  itself.  Among  the  advantages  of  global 
operating  systems  are  the  ability  to  use  idle  resources 
and  to  control  the  environment  as  a  single  cohesive 
entity.  This  paper  introduces  an  architectural 
approach  to  supporting  decentralized  global  operating 
systems.  The  approach  addresses  the  problem  of 
managing  distributed  data  by  incorporating 
specialized  data  management  facilities  in  the  kernel. 
This  data  management  is  especially  useful  to  the 
operating  system  itself.  A  capability-based  access 
scheme  provides  flexible  control  of  resources  and 
autonomy.  The  approach  is  being  utilized  in  the 
Clouds  operating  system  project  at  Georgia  Tech. 


The  concept  of  a  global  operating  system  embraces 
these  requirements  [Jens82]  [Lamp81],  In  a  global 


inese  requirements  uensozj  [LampBij.  in  a  global 
operating  system,  all  resources  are  managed  and 
allocatea  globally.  The  physical  locality  of  a  resource  - 
whether  local  or  remote,  for  example  -  is  not  inherently 
a  part  of  the  decision  process.  Decisions  can  be  made 
solely  on  the  basis  of  cost  factors  and  logical 


solely  on  the  basis  of  cost  factors  and  logical 
constraints,  rather  than  physical  locality.  For 
example,  assignment  of  a  processor  to  a  process  might 
be  performed  on  the  basis  of  code  file  location,  expected 
DO,  expected  CPU  utilization,  and  current  processor 
and  network  loadings. 


Transparency  appears  to  be  a  key  auality  in  the 
architecture  of  a  decentralized  global  operating 
system.  This  has  two  main  aspects: 


Resource  Access: 


Boundaries  between 

machines  should  be 
transparent  during  access  to 
resources  if  desired.  This  is 
provided  by  many  existing 
inter-process  communication 
mechanisms  (e.g.,[Rash81] ). 


1.  Introduction 


Increasingly,  the  computers  within  an  organization 
consist  of  a  heterogeneous  group  of  machines  linked  by 
high  speed  (yet  relatively  inexpensive)  local  area 


high  speed  (yet  relatively  inexpensive!  local  area 
networks.  Mainframes,  office  stations,  scientific 


Decision  Apparatus:  The  relative  physical 
locations  of  a  policy  apparatus 
and  the  resources  it  controls 


workstations,  personal  machines,  and  even  real-time 
controllers  may  participate  in  this  internetwork  of 
machines.  In  many  such  environments,  it  is  desirable 
that  users  view  the  entire  decentralized  resource  pool 
as  a  single  computing  resource.  Users  could  then  be 
shielded  from  multiple  user  interfaces  and  relieved 
from  having  to  decide  how  best  to  accomplish  objectives 
using  the  available  resources. 


should  be  transparent  to  the 
policy  apparatus.  To  do  this 
efficiently  may  also  require 
that  data  describing  resources 
be  accessible  independently  of 
its  physical  location. 


Given  a  system  supporting  transparency  in  its  access  to 
resources  and  its  decision  apparatus,  construction  of 


In  effect,  we  want  to  hide  the  decentralization  of  the 
resources  from  users,  so  they  do  not  have  to  be 
concerned  with  which  particular  resources  are  used  to 
accomplish  their  objectives.  Furthermore,  we  need  to 
group  resources  for  the  purposes  of  autonomy  and 
protection,  whether  the  resources  in  question  are  files, 
machines,  or  logical  services.  For  a  completely  general 
facility,  support  for  arbitrary  (possibly  intersecting) 
groupings  of  resources  is  needed. 


resources  and  its  decision  apparatus,  construction  of 
arbitrary  groupings  of  resources  is  simplified,  as  is 
allocation  of  resources  on  a  global  basis.  These 
qualities  are  not  necessarily  easy  to  achieve,  however. 
For  example,  for  a  decision  apparatus  to  operate 


independently  of  the  resources  it  controls,  it  may  need 
the  ability  to  mack-up”  if  a  remote  processor  fails  after 
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or  Naval  Research,  NOOOI4-79  C-873  and  the  USAF  Rome  Air 
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a  decision  is  made.  Furthermore,  decentralization 
requires  that  decisions  be  made  on  the  basis  of 
heuristics  or  probabilities,  using  out-of-date  or 
inconsistent  data  [Jens81], 


In  this  paper  we  introduce  a  structural  approach,  or 
architecture,  for  a  system  designed  to  support 
decentralized  global  operating  systems.  In  this 
approach,  which  is  still  being  refined  by  the  Clouds 
project  at  Georgia  Tech,  we  take  the  view  that  kernels 
provided  to  support  the  operating  system  on  each 
machine  should  provide  the  uniformity  and 
transparency  required.  Using  the  object  model  (data 
abstraction)  as  a  basis,  we  intend  to  provide  a 
sophisticated  database  management  system  within 
kerne's,  but  leave  the  specific  details  of  aspects  such  as 
synchronization,  recovery,  and  atomicity  to  the 
designers  of  the  operating  system  (the  client  system) 
that  utilizes  the  kernels.  The  kernels  provide 
mechanisms  to  implement  these  requirements  without 
specifying  policy  of  how  the  mechanisms  should  be 
used. 


The  paper  discusses  the  goals  (Section  2),  requirements 
(Section  3),  and  architectural  concepts  (Section  4)  for 
this  approach.  The  approach  is  being  implemented  in 
the  Clouds  operating  system,  which  will  run  as  a 
native  operating  system  on  all  participating  machines. 
The  environment  assumed  is  a  group  of  machines 
connected  via  an  internetwork  of  high-speed 
(inexpensive)  local  area  networks.  For  practical 
reasons.  Clouds  is  being  implemented  initially  on 
homogeneous  machines:  the  Three  Rivers  Perq 
[3RCC82].  The  Perq  is  a  scientific  workstation  of 
minicomputer  capacity.  We  are  using  10  Mb/sec 
Ethernet  technology  for  the  local  area  networks. 

2.  Goals 

The  environment  we  are  considering  has  been 
characterized  as  a  Fully  Distributed  Processing  System 
(FDPS).  According  to  Enslow  [Ensl78],  an  FDPS 
exhibits  the  characteristics  of  a  multiplicity  of 
resources,  physical  distribution,  unity  of  control, 
network  (location)  transparency,  and  component 
autonomy.  The  primary  goal  of  the  architecture  is  to 
support  a  reliable,  unified  computing  environment  so 
that  these  characteristics  can  be  fully  realized.  In  this 
sense,  the  architecture  could  form  the  basis  of  a 
distributed  timesharing  system.  While  such 
constraints  as  "one  user  per  workstation”  might  hold  at 
various  times  (making  some  decisions  trivial),  the 
system  can  take  responsibility  for  all  selection  and 
assignment  of  resources  to  users.  Note  that  an 
architecture  can  form  a  basis  for  many  different 
systems,  not  just  the  traditional  "general  purpose" 
systems.  Systems  supported  might  include  distributed 
process  control,  for  example. 


Two  secondary  goals  are  apparent.  Firstly,  as  in 
conventional  operating  systems,  the  architecture 
should  facilitate  high  resource  utilization  within 
performance  constraints  such  as  response  time  or  total 
cost.  Secondly,  it  should  help  users  to  access  or  create 


services  that  are  common  to  conventional  systems  am) 
services  that  are  peculiar  to  distributed  systems.  Many 
of  the  requirements  of  application  programs  for  these 
services  are  shared  by  the  operating  system  itself, 
which  attempts  to  provide  reliable  service  despite  the 
possibility  of  machine  and  network  failures. 


Our  final  goal  is  to  provide  tunable  autonomy-- 
dynamically  configurable  domains  of  resource  control. 
A  tunably  autonomous  system  could  provide  a  variety 
of  resource  allocation  schemes,  varying  from  highly 
autonomous  systems,  to  the  equivalent  of  tightly- 
coupled  multiprocessing,  where  decisions  affecting  one 
machine  can  be  made  by  any  other  machine.  Tunable 
autonomy  facilitates  construction  of  logical  resource 
groupings  at  multiple  levels.  For  example  office, 
department,  division,  company,  and  intercompany 
levels  might  be  established,  with  differing  autonomy 
and  sharing  constraints  at  each  level. 


3.  Requirements 


Two 


two  issues,  data  management  and  resource 
management,  stand  out  as  fundamental  to  the  Clouds 
architecture.  Only  the  mechanism  for  resource 
management  has  to  be  provided  by  the  architecture, 
but  requirements  for  effective  and  efficient  resource 
policy  must  be  given  consideration. 

3.1  Data  Management 

Data  management  is  a  ubiquitous  problem  in  computer 
programs.  The  problem  is  particularly  severe  in 
distributed  systems.  Conventional  operating  systems 
contain  a  plethora  of  structures  representing  system 
state.  A  global  operating  system  must  do  the  same, 
and  must  also  deal  with  additional  issues  including 
increased  concurrency,  partial  configurations,  and 
failures  and  associated  recovery.  Each  nod  -  must  be 
able  to  access  both  local  and  remote  data.  Considerable 
research  has  been  expended  studying  general  problems 
encountered  in  managing  distributed  data,  but  little 
attention  has  been  paid  to  problems  peculiar  to 
decentralized  real-time  systems.  As  a  consequence, 
special  solutions  tend  to  be  reinvented  for  each 
application  system  (e.g.,  [Birr81])  and  for  each  part  of 
the  operating  system. 


Conventional  database  research  usually  assumes  an 
application  environment  in  which  data  consistency 
(defined  through  serializability)  is  of  prime 
importance.  However,  serializability  is  applicable 
mainly  when  independent  activities  compete  for 
resources--it  is  not  suited  to  cooperation  between 
processes,  such  as  is  achieved  through  message-based 
interprocess  communication.  Thus  it  is  necessary  to 
deal  with  an  orthogonal  structure  of  atomic  actions 
which  can  be  the  units  of  recovery  and  concurrency 
Further,  one  of  the  attractions  of  serializability  is  its 
simplicity  in  the  absence  of  semantic  information 
concerning  data  accesses.  This  simplicity  is  obtained 
at  the  cost  of  concurrency  though,  and  in  an  operating 
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system  considerable  semantic  information  is  available 
concerning  both  the  accesses  and  the  operations  on  the 
data  stored.  This  information  can  be  exploited  to 
improve  concurrency,  and  thus  availability  and 
performance. 


Due  to  the  scope  of  distributed  data  management 
requirements,  data  management  takes  a  prominent 
place  in  our  architectural  approach.  Distributed  data 
management  can  be  made  quite  sophisticated, 
supporting  failure  atomicity,  consistency  conditions 
including  (but  not  limited  to)  serializability,  creation 
and  location  of  data  objects,  synchronization  of  access 
to  data,  replication  of  data,  and  invocation  of 
operations  on  data.  It  can  also  provide  a  basis  for 
system  synchronization  [Allc83], 

3.2  Resource  Allocation 

Consideration  of  resource  allocation  requirements  is 
critical  to  the  success  of  a  global  operating  system 
architecture.  Ideally,  an  operating  system  should  take 
complete  responsibility  for  the  allocation  of  resources 
to  a  user,  if  the  user  so  desires.  The  architecture  must 
provide  facilities  to  support  this  allocation  control. 


Consider  the  example  depicted  in  Figure  1.  A  user 
directly  connected  to  machine  A,  wishes  to  run  a 
program  p,  which  requires  serial  access  to  file  b 
currently  on  machine  B,  and  random  access  to  file  c  on 
machine  C.  The  operating  system  must  determine 
which  machine  should  provide  the  computational 
resources  necessary  to  run  p,  and  whether  any  files 
should  be  relocated  beforehand.  Factors  involved 
include  user-specified.  constraints  (e.g.,  ’fast',  or 
’cheap’),  optimization  of  particular  resources  (e.g., 
device  channels,  processor  time,  network  bandwidth), 
current  loads,  and  interactions  with  other  programs. 
The  program  p  may  run  on  machine  A,  but  unlike 
many  conventional  distributed  operating  systems  (e.g., 
(RedeSOj),  this  is  determined  heuristically  at  the  time 
of  request;  it  is  not  a  default.  Thus,  we  are  advocating 
a  more  "intelligent”  approach  to  resource  allocation. 

3.2.1  Tunable  Autonomy 

The  term  tunable  autonomy  characterizes  the  ability 
to  construct  arbitrary  logical  groupings  of  resources  for 
the  purposes  of  management.  A  single  group  of 
resources,  or  resource  pool  might  contain  one  or 
many  resources,  may  intersect  or  contain  other 
resource  pools,  and  can  extend  across  arbitrary 
machines.  Once  a  resource  pool  is  defined,  decisions 
concerning  the  resources  within  it  can  be  made  by  any 
processor  (or  program)  granted  control  over  that  pool, 
regardless  of  physical  location.  Within  this 
framework,  allowing  each  physical  machine  to  be 
autonomous  is  a  constraining  case,  but  not  the  only 
possible  case. 

3.3.2  The  Decision  Process 

While  a  decentralized  global  operating  system  should 
have  considerable  freedom  to  assign  resources,  its 
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ability  to  assign  them  effectively  will  be  limited  by  out- 
of-date  and  incomplete  information.  Machine  states 
change  rapidly,  so  a  perfectly  consistent  description  of 
the  state  of  an  entire  system  cannot  be  achieved 
without  paving  a  high  performance  penalty.  As  a 
result,  reaching  a  decision  involves  a  more  heuristic  or 
probabilistic  approach  than  in  a  conventional,  single 
processor  system.  More  historic  data  to  assist 
prediction  can  help,  though;  a  system  could  keep 
extensive  statistics  of  past  activity.  For  example,  if  file 
C  in  Figure  1  is  known  to  be  small,  and  program  p 
typically  makes  extensive  random  access  to  c,  the 
operating  system  might  decide  to  either  relocate  c  to 
machine  A,  or  to  run  p  on  machine  C. 


A  second  concern  of  the  decision  process  pertains  to 
decisions  that  have  been  made,  but  cannot  be 
implemented.  To  a  limited  extent,  this  can  be  avoided 
by  preclaiming  resources,  but  the  problems  of  failures 
cannot  be  avoided.  If  a  machine  fails,  it  may  be 
impossible  to  implement  a  resource  policy  decision.  In 
this  circumstance,  the  policy  apparatus  must  try  again, 
basing  the  next  attempted  decision  on  more  recent 
information. 

4.  Architectural  Directions 

In  many  operating  systems  for  distributed 
environments,  a  kernel  provides  primitive  operations 
for  inter-process  communication  (IPC)  via  messages 
(e.g.,  (Rash81J).  We  intend  to  take  a  radically  different 
approach,  however,  adopting  the  dual  structure 
[Laue79].  The  architecture  supports  processes, 
(passive)  objects,  and  invocation  of  operations  on 
objects.  Actions  (groupings  of  invocations)  provide  a 
basis  for  reliability.  Tbe  primitive  facilities  provided 
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by  the  architecture  use  a  variety  of  remote  procedure 
call  semantics  which  vary  in  dimensions  such  as 
reliability  and  asynchrony  [Spec82].  A  section  of  the 
kernel,  called  the  object  management  system  (OMS), 
implements  calls  on  objects.  Access  from  processes  to 
objects  is  via  capabilities,  which  are  protected  system 
names  managed  by  the  kernel.  Capabilities  can  be 
passed  as  parameters  when  operations  are  invoked; 
they  can  also  be  returned  in  a  fashion  similar  to 
function  values. 


The  kernel  will  provide  the  lower  levels  of  a  functional 
hierarchy.  At  the  lowest  level  is  the  hardware,  which 
we  consider  to  include  access  to  a  local  area  network. 
At  the  level  above  the  hardware  is  the  primitive  inter¬ 
machine  communication  used  by  the  individual  kernels 
to  communicate  with  one  another  and  to  maintain  the 
object  management  system.  Data  and  process 
management  mechanisms  then  complete  the  kernel 
and  the  architecture.  Thus,  the  combination  of  services 
provided  by  individual  kernels  implements  the 
architecture  for  the  complete  system.  Above  the 
kernels  are  client  levels  to  provide  policy  for  the 
architecture.  Finally,  user  processes  implement 
applications.  A  pictorial  representation  of  the 
architecture  is  shown  in  Figure  2. 


Kernels  run  processes  and  maintain  objects.  However, 
be'ause  this  is  at  the  instigation  of  higher  level 
software,  conventional  message-based  inter  process 
communication  is  considered  to  be  part  of  the  client 
system,  as  is  the  resource  allocation  policy  apparatus. 
These  characteristics  are,  in  fact,  nighly  desirable 
features,  because  they  allow  client-specifiable  inter¬ 
process  communication,  and  permit  a  high  degree  of 
flexibility  in  resource  policy.  A  method  of  using  objects 
to  implement  interprocess  communication  via 
messages  is  shown  in  Figure  3. 


4.1  Data  Management 

The  object  management  system  consists  of  two  primary 
components:  objects  and  actions;  both  are  user 
definable  [Allc82]  [Allc83J.  Objects  are  passive  entities 
(modified  abstract  data  types)  which  are  accessible 
only  through  interface  procedures  that  define 
operations  on  the  objects.  Actions  are  ordered 
collections  of  operations  on  objects  which  require 
certain  properties  (e.g.,  failure  atomicity)  to  hold 
throughout  the  life  of  the  action.  Both  object  recovery 
and  synchronization  between  actions  are  controllable 
by  the  object  itself  (i.e.,  programmed  within  the  object). 
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Thus,  for  example,  weaker  forms  of  consistency  are 
allowed,  depending  on  the  semantics  of  an  object. 
Actions  can  be  carried  out  by  single  processes  or 
cooperating  processes. 


An  extended  Pascal  language  allows  object  classes  to 
be  defined,  and  the  object  management  system 
supports  objects  at  runtime.  Once  created  via  object 
classes,  object  instances  are  controlled  through 
requests  to  the  kernel  using  OMS  primitives,  such  as 
create  object,  destroy  object,  create  action,  destroy 
action,  commit  action,  and  invoke  operation.  Object 
classes  are  exported,  so  object  variables  can  be  typed 
automatically  in  a  manner  similar  to  Pascal  pointers. 
Thus,  the  object  management  system  can  be  viewed  as 
a  globally  distributed  heap  containing  long-lived 
objects.  For  transparency,  all  actions  communicate 
only  with  the  OMS  at  the  node  where  the  process 
implementing  the  action  resides  (not  shown  in  the 
conceptual  view  of  Figure  2).  The  OMSs,  in  turn, 
communicate  using  specialized  protocols  for  inter- 
machine  communication.  They  cannot  use  the  general 
EPC  facility,  because  they  form  its  basis. 


A  primary  goal  of  the  OMS  is  support  for  network 
transparency,  wherever  desired.  This  transparency  is 
provided  by  making  operation  invocation  uniform, 
regardless  of  whether  the  subject  of  the  operation  is  on 
the  same  machine  as  the  client  invoking  the  operation. 
The  operating  system  is  thus  free  to  distribute 
processes  and  objects  without  their  knowledge  (unless 
specifically  directed  not  to  do  so).  Thus,  languages  for 
distributed  computing,  such  as  PRONET  [Macc82]  or 
Argvs  [Lisk92]  can  be  well  supported. 

4.2  Resource  Management 

Because  this  paper  is  concerned  primarily  with 
describing  an  architectural  approach,  we  will  not 
discuss  resource  management  in  detail.  The  basic 
approach  is  to  use  the  capability-passing  mechanisms 
provide  by  the  object  management  system  to  construct 
capability  managers  IKieb78].  Due  to  the 
transparency  implemented  by  the  architecture, 
capability  managers  function  independently  of  their 
location  in  the  system.  Since  all  invocations  of 
operations  are  via  capabilities  and  possession  of  a 
capability  is  taken  as  permission  to  invoke  an 
operation,  there  is  no  structural  association  of 
particular  machines  with  particular  decisions.  Any 
object  that  possesses  a  capability  to  implement  policy 
decisions  can  implement  those  decisions.  For  example, 
each  machine  might  contain  a  process- management 
object  whose  function  is  to  instantiate  and  destroy 
processes  on  that  machine.  Any  object  that  possesses 
capabilities  for  the  operations  of  this  process- 
management  object  can  then  create  and  destroy 
rocesses  on  that  particular  machine.  The  capability- 
ased  access  scheme  makes  it  unnecessary  to  have  a 
"special  state"  for  resource  managers~any  object  can 
become  a  resource  manager,  thus  making  it  possible  to 
construct  arbitrary  pools  of  resources  independently  of 
machine  boundaries.  Of  course;  a  choice  of  appropriate 
capability-passing  primitives  is  critical  to  the  success 
of  this  approach  [Snyd81]. 


5.  Summary 


This  paper  has  introduced  an  architectural  approach 
for  decentralized  global  operating  systems  in  an 
environment  of  machines  connected  via  an  internet  of 
high-speed  local  area  networks.  A  global  operating 
system  manages  all  resources  globally,  without 
making  distinctions  between  local  and  remote 
resources.  One  characteristic  desirable  of  such  systems 
is  tunable  autonomy:  the  ability  to  construct 
arbitrary  logical  groupings  of  resources  for  the  purpose 
of  management.  Such  groupings  are  independent  of 
machine  boundaries. 


A  major  motivating  factor  in  the  design  of  the 
architecture  is  the  need  for  reliable  data  management 
in  the  low  levels  of  the  system.  This  can  be  achieved 
efficiently  by  making  some  constraints,  such  as 
serializability,  optional  according  to  particular  needs. 
Requirements  for  global  resource  management  also 
motivate  the  operating  system  architecture. 


The  architecture  described  provides  processes  and 
objects;  invocation  of  operations  on  objects  is  performed 
through  capabilities.  Objects  are  maintained  by  the 
object  management  system,  a  component  of  the  kernel. 
Despite  its  integration  into  a  low  level  of  the  operating 
system,  the  object  management  system  is  quite 
sophisticated,  providing  an  action  environment  in 
which  actions  may  initiate,  commit,  and  abort,  with 
appropriate  effect  on  object  states.  To  assist 
performance  and  reliability,  the  object  management 
system  supports  variable  recovery,  synchronization, 
replication,  and  consistency  conditions  including  (but 
not  limited  to)  serializability. 


From  the  base  architecture,  message  based  inter¬ 
process  communication  and  a  resource  policy  apparatus 
can  be  constructed.  The  capability-based  invocation  of 
operations  on  objects  makes  it  possible  to  construct 
resource  managers  as  capability  managers.  Arbitrary 
objects  can  manage  capabilities,  depending  on  the 
capability-passing  primitives  of  the  object 
management  system  to  provide  the  neccessary  access 
control.  Thus,  resource  pools  can  be  constructed 
dynamically,  and  can  exist  independently  of  machine 
boundaries. 
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ALGORITHMS  FOR  MAINTAINING  REPLICATED  DATA  USING  WEAK  CORRECTNESS  CONDITIONS 


Jaaes  E.  Allchin 


Abstract 


A  suite  of  decentralized  algorithms  for  maintaining  distributed  replicated  data  is  presented.  The 
algorithms  do  not  necessarily  achieve  serial  consistency,  but  they  are  adequate  for  many  simple  data 
storage  problems  in  operating  systems  and  realtime  systems.  Applications  which  appear  well-suited 
to  the  suite  include  mail  systems,  naming  servers,  appointment  calendars,  certain  types  of  Hie 
dictionaries,  operating  system  load  tables  ie.g.,  routing),  and  device  stale  in  distributed  process 
control  systems  The  algorithms  arc  robust  and  are  intuitively  easy  to  understand.  The  algorithms 
assume  an  unreliable  network  and  tolerate  node  failures,  network  partitions,  lost,  duplicate,  and  out- 
of-order  messages.  Both  goals  for  replicating  data-high  availability  and  rapid  response  time--arc 
met  by  the  algorithms.  The  basic  algorithms  use  resolution  tables  to  state  the  outcome  of  conflicts 
between  concurrent  actions.  Each  algorithm  is  oriented  toward  different  application  requirements 
and  provides  a  different  degree  of  message  traffic  overhead  and  availability.  The  efficiency  of  the 
algorithms  depends  on  the  acceptability  of  weak  correctness  conditions  in  the  applications.  The 
desired  correctness  condition  is  formally  stated  and  the  basic  algorithm  in  the  suite  is  proven  correct 
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t.  Introduction 


The  correctness  condition  usually  applied  to  data  storage  systems  states  that  the  result  of  any  set  of 
transactions  executed  should  be  the  same  as  some  serial  execution  of  that  set  of  transactions.  This 
senalizability-bascd  correctness  condition  assumes  only  that  transactions  execute  correctly  if  run 
serially.  Distributed  systems  containing  replicated  data  require  relatively  complicated  algorithms  to 
achieve  serial  consistency  and  still  obtain  acceptable  performance  These  algorithms  restrict 
concurrency  in  order  to  achieve  consistency.  In  certain  applications,  however,  serializability  as  a 
correctness  condition  is  not  required  because  the  results  for  some  class  of  non-serializable  executions 
of  transaction  steps  are  correct.  See,  for  example,  (Lamp76|,  (Kung80|,  and  (Garc80|  In  addition, 
there  are  applications  where  even  though  the  transactions  desire  to  see  a  serial  view,  they  will  accept 
some  class  of  non-serial  views  and  consider  these  views  correct  as  well.  These  applications  will  accept 
non-serial  views  in  return  for  certain  advantages  not  possible  if  strict  serial  consistency  is  enforced 
Performance  !Jens82,  McKe83|,  availability  lFisc82|,  and  simplicity  lOppe811  have  been  cited  as 
encouragements  to  weaken  correctness.  Thus,  there  is  an  interesting  class  of  application  areas  for 
which  trading  serial  consistency  for  high  availability,  increased  performance  or  algorithm  simplicity 
is  warranted. 

There  are  many  approaches  for  supporting  copies  of  replicated  data  |Bern8l  |.  Most  of  these  maintain 
serial  consistency.  However,  maintaining  serial  consistency  across  network  partitions  (caused  by 
assumed  failure  of  the  communication  system)  defeats  data  availability,  since  at  least  one  copy 
cannot  be  used  and  in  the  worse  case,  only  one  copy  can  be  used.  If  weak  consistency  can  be  tolerated, 
then  it  is  possible  to  overcome  this  problem.  I  lowevcr,  resynchronization  of  the  data  copies  must  still 
be  addressed  during  node  restart  or  network  merge  following  a  partition.  Contending  with  these 
issues  in  an  unreliable  environment  complicates  the  solutions  still  more.  Algorithms  which  handle 
all  of  these  issues  tend  to  be  complex,  using  a  variety  of  expensive  handshaking  protocols. 
Kstablishing  that  these  algorithms  are  correct,  under  all  the  possible  failure  conditions,  is  generally 
quite  difficult. 

In  this  paper,  we  present  a  suite  of  decentralized  algorithms  to  maintain  distributed  replicated  data 
with  weak  consistency  Algorithms  from  the  suite  can  be  customized  to  balance  particular  tradeoffs 
required  in  different  application  systems  The  algorithms  assume  an  unreliable  network  and  tolerate 
lost,  duplicate,  and  out-of-order  messages,  node  failures  and  network  partita  ns  Both  goals  for 
replication-  high  availability  and  rapid  response  time--are  met  by  the  algorithms  The  basic 
structure  of  the  algorithms  depends  on  resolution  tables  to  state  the  outcome  of  conflicts  between 
concurrent  actions.  Each  algorithm  is  oriented  toward  different  application  requirements  and 
provides  a  different  degree  of  message  traffic  overhead  and  data  availability.  The  efficiency  of  the 
algorithms  depends  on  the  acceptability  of  weak  correctness  conditions  in  .application  sy  stems 

Work  which  is  similar  to  ours  is  discussed  in  (Fisc82|,  |John75|,  I0ppc81 1,  and  |  MctJuTSI  Most  of  our 
problem  formulation  is  based  on  Fisher  and  Michael  lFisc82|.  Unlike  their  approach,  though,  we  do 
not  require  each  node  to  transmit  the  entire  node's  view  of  the  database  whenever  communication 
occurs:  only  the  changes  to  the  view  are  sent,  and  regardless  of  the  number  of  deletes  and  duration  of 
failure,  no  node  need  maintain  an  unbounded  list  of  changes  relative  to  the  database  size  (assuming 
the  database  is  itself  bounded).  We  believe  that  in  particular  cases  (e  g.,  small  sized  databases) 
passing  the  entire  database  is  appropriate  while  in  many  other  applications,  this  requirement  is  not 
acceptable 

Our  work  is  particularly  interesting  because  we  use  resolution  tables  which  allow  easy  visualization 
of  the  conflict  resolution  strategy  and  we  provide  a  formal  proof  of  one  algorithm  from  the  suite  (with 
other  proofs  following  in  a  straightforward  manner  from  the  framework  presented).  Further,  it  is  our 
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belief  that  the  suite  of  algorithms  address  a  wide  range  of  important  problems  in  a  clean  and  efficient 
manner.  Allchin  (Allc83 1  contains  additional  information  on  the  desirability  of  supporting  both 
serializable  and  non-serializable  synchronization  facilities  in  decentralized  systems  (in  particular 
operating  systems).  Specific  programmer-oriented  tools  for  controlling  atomic  action  synchronization 
and  recovery  are  also  presented. 


2.  Environment  and  Application  Domains 


The  general  environment  assumed  by  the  algorithms  involve  some  collection  of  nodes  arbitrarily 
connected  via  some  communication  system  The  communication  network  and  nodes  are  considered 
unreliable,  that  is,  both  may  fail  either  partially  or  totally  Messages  if  delivered,  however,  must 
arrive  ungarbled.  That  is,  message  corruption  must  be  detectable  In  addition,  it  is  assumed  that  the 
nodes  and  the  network  do  not  manufacture  messages  which  violate  the  protocol  of  the  algorithm 


Each  node  contains  a  view  of  the  entire  database  which  may  or  may  not  be  current  depending  on  the 
state  of  communication  activity.  It  is  later  proven  that  with  sufficient  reliability  and  assuming 
changes  to  the  data  cease,  then  all  views  will  converge  to  contain  the  same  data.  That  is,  the  views 
are  mutually  consistent  [Thom79| 

Each  view  consists  of  a  set  of  elements,  item  names  and  associated  values.  Clients  manipulate  views 
by  specifically  referencing  (via  names)  particular  elements  in  the  views.  There  are  four  operations 
which  manipulate  a  node's  view  These  eventually  alter  the  other  remote  views  (if  the  changes  are 
not  superceded  before  the  other  nodes  learn  of  the  fust  change)  The  four  basic  operations  are 


Insert  (x,y) 

Update  (x.y) 
Delete  (x) 

List  (set  of  names) 


adds  an  <  lc  au  nt  v\  ith  name  x  and  value  y 

replace^  the  value  of  the  element  with  name  x  with  the  value  y 

removes  t  he  element  with  name  x 

returns  an  ordered  pair  of  element  names  and  values  for  all  elements 
requested  which  exist  in  the  local  view  at  the  time  of  the  operation 


Fischer  and  Michael  (Fisc82l  referred  to  a  very  similar  environment  and  operation  structure  as  a 
distributed  dictionary  problem.  In  fact,  the  main  difference  is  that  we  include  an  Update  operation 
This  is  an  important  change,  not  simply  a  trival  extension.  This  is  true  because  we  also  require 
basically  the  same  two  restrictions: 

K|.  Neither  Update(x,y )  nor  Delete!  x)  can  be  performed  at  node  i  unless  the  element  x  i> 
in  the  local  view  at  node  i. 

K2.  All  item  names  used  in  Insert  operations  must  be  unique 

The  second  restriction  explains  why  the  inclusion  of  the  Update  operation  alters  the  problem's 
structure  An  Update  is  thus  a  primitive  operation  which  can  not  be  formed  from  Insert  and  Delete 
operations.  These  restrictions  are  required  by  the  algorithms  and  are  quite  reasonable  in  the 
application  domains  discussed  below.  Rj  is  quite  intuitive  since  operations  by  definition  must  name 
elements  from  the  local  view.  Rs  provides  the  assurance  that  once  an  item  name  has  been  deleted,  it 
can  not  be  reinserted.  This  avoids  a  conflict  which  would  require  some  additional  means  to  order  the 
Insert  and  Delete  operations  (conflict  resolution).  Throughout  the  remainder  of  this  paper  a  change 
refers  to  either  an  Insert,  Update,  or  Delete  operation 
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There  are  two  additional  operations:  Transmit  and  Receive.  Transmit  is  used  by  a  node  to  send 
information  concerning  its  view  to  other  nodes  Receive  is  automatically  invoked  when  a  message 
containing  information  from  some  other  node  is  received  No  information  flows  between  .  .dews 
unless  Transmit  operations  are  issued  by  a  client.  The  frequency  of  Transmit  operations  dictate  how 
current  a  particular  view  is  for  some  node.  It  is  pi  osumed  that  clients  will  issue  T ransmit  operations 
often  enough  so  that  views  will  converge  acceptably  often  for  the  applications 

All  six  operations  must  be  non  interfering  when  manipulating  the  local  view.  Regardless  of  the 
method  used,  atomicity  with  respect  to  concurrent  activity  among  the  operations  is  assumed 
Because  the  maintenance  of  the  view  should  be  relatively  inexpensive  for  many  applications,  mutual 
exclusion  may  suffice. 

The  operations  and  associated  restrictions  presented  above  appear  to  be  sufficient  for  several 
application  areas  Applications  which  include  problems  related  to  maintaining  some  form  of 
replicated  dictionary  mesh  well  with  our  approach  For  example,  some  distributed  applications  which 
include  this  type  of  problem  are  mail  systems,  naming  servers,  file  directories,  appointment  calendars 
and  operating  system  load  data  maintenance  (.'  g  .  routing  tables).  In  addition,  applications  like 
process  control  systems  which  alter  data  values  rapidly,  but  do  not  require  serial  consistency  can  also 
be  supported  These  applications  tend  to  he  ,elf  correcting  in  nature  and  do  not  necessarily  require 
serial  consistency 


3.  General  Suite  Structure 


Wo  divide  databases  into  two  types  independent  and  dependent  Independent  databases  permit 
elements  to  be  changed  by  any  node  in  the  network.  Thus  once  a  data  item  name  has  been  created,  it 
can  be  manipulated  by  any  node  in  the  network.  Dependent  databases  permit  elements  to  be  changed 
only  by  the  node  which  created  the  data  item  nunv  That  is,  changes  depend  on  which  node  was  the 
item’s  creator 

We  also  consider  two  levels  of  fault-tolerance  propagation  and  no  propagation.  Propagation  implies 
every  node  must  guarantee  all  other  nodes  receive  a  change,  even  when  a  node  is  not  directly 
responsible  for  the  change  Thus,  even  if  the  node  which  makes  some  change  fails  (or  outward 
communication  from  the  node  fails)  the  change  can  still  propagate  through  the  network,  depending 
on  the  state  of  the  other  nodes  and  remaining  communication  system  This  approach  implies  that 
data  availability  is  more  significant  than  message  traffic  overhead  and  local  storage  space  The 
information  regarding  the  change  is  stored  at  each  node  until  that  node  is  sure  every  other  node  has 
seen  the  result  of  that  change.  No  propagation  implies  that  only  the  node  responsible  for  performing 
a  change  must  ensure  that  every  other  node  has  received  the  change. 

The  message  distribution  procedure  is  not  specified  in  the  suite,  since  the  network  topology  and 
availability  requirements  dictate  how  messages  are  actually  distributed  throughout  the  network. 
Messages  could  be  broadcast,  multicast,  or  simply  sent  to  some  next  node 

The  structure  of  the  suite  consists  of  a  base  algorithm  and  resolution  tables  to  specify  algorithm 
actions  when  changes  occur  locally  or  are  received  from  a  remote  node.  There  is  a  different  resolution 
table  for  each  combination  of  database  type  and  fault-tolerance  discussed  above.  The  base  algorithm 
need  not  be  changed.  In  the  following  we  present  an  overview  discussion  of  the  basic  data  structures 
and  base  algorithm.  Then  in  Section  3.2  we  discuss  certain  aspects  of  the  algorithm  in  detail 
Finally,  in  Section  3.3  the  base  algorithm  is  presented  together  with  the  first  resolution  table. 


3.1  Algorithm  Overview 
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The  basic  algorithm  is  assumed  to  be  replicated  at  alt  nodes.  Three  basic  data  structures  are  used  in 
the  algorithm;  each  node  has  a  separate  set  of  these  variables: 

V,  the  database  view  for  node  i. 

SS,  a  list  at  node  <  of  changes  which  may  not  have  been  seen  by  the  other 

nodes.  (SS  represents  a  s>  nchronizution  set.) 

t,  a  timestamp  array  which  details  how  current  node  i's  knowledge  is  of 

every  other  node. 

Both  SS  and  t  are  transmitted  to  some  collection  of  remote  nodes  upon  a  client's  Transmit  request.  V 
is  not  sent  between  nodes  except  during  a  cold  start  of  a  node;  see  Section  5.6. 

When  a  change  occurs  at  some  node  i,  the  change  is  reflected  in  the  SS,  and  the  database  view  V,.  The 
change  is  marked  with  the  current  value  of  a  node-relative  Clock.  Synchronization  sets  contain  at 
most  one  entry  for  every  item  name  changed.  A  particular  change  may  be  superceded  at  any  time, 
either  before  leaving  the  originating  node  or  at  some  luter  intermediate  destination.  Since  u  change 
may  be  removed  from  the  SS  before  all  nodes  have  seen  that  change,  another  method  is  used  to  permit 
a  node  to  determine  when  the  changes  have  been  processed  by  remote  nodes.  The  timestamp  array  t 
is  used  for  this  purpose  This  array  is  indexed  by  node  number.  The  value  of  each  entry  represents  a 
node-relative  Clock  number.  For  example,  if  t,|5|  -  3,  then  this  means  that  node  i  has  seen  the  result 
of  all  changes  from  node  5  through  time  3  (relative  to  node  5).  We  use  the  term  result  here  because 
changes  can  be  superceded  in  the  synchronization  set  at  any  time.  Thus  a  node  may  never  see  certain 
changes,  it  could  see  some  newer  change. 

In  the  propagation  approach  a  node  i  maintains  an  SS  entry  for  every  change  entry  applied  to  the 
database  locally  until  node  i  is  sure  every  other  node  has  received  the  change.  When  a  SS  arrives  il  is 
merged  with  the  local  SS.  Entries  may  be  added  *»v  deleted  to  both  V  and  SS  according  to  the 
resolution  table.  Removal  of  a  change  entry  from  a  node's  SS  can  occur  in  one  of  two  ways: 

case  1:  A  node  can  be  passed  a  SS  containing  a  change  entry  which  has  been  seen  by  all  nodes 
except  for  the  receiving  node. 

case  2:  A  node  can  receive  a  SS  which  docs  not  contain  the  change  item  and  the  received  t  array 
shows  that  the  sending  node  has  seen  a  result  from  the  node  where  the  change 
originated  at  least  through  the  time  when  the  change  occurred.  Because  the  sending 
node  definitely  has  seen  the  change  and  does  not  have  the  change  entry  on  the  SS,  we 
know  (by  induction)  that  case  t  must  have  occurred  at  some  node  in  the  past.  Thus  the 
change  entry  at  the  receiving  node  can  be  deleted. 

In  the  no  propagation  approach,  the  node  performing  the  change  is  the  only  node  which  maintains  the 
change  entry  on  the  SS.  The  other  nodes  perform  the  change,  but  do  not  change  their  SS  An 
originating  node  i  can  determine  when  an  entry  has  been  seen  by  all  nodes  when  it  receives  t  arrays 
from  all  remote  nodes  which  reflect  a  time  for  node  t  greater  than  the  time  when  the  change  was 
performed. 

After  node  i  receives  and  processes  some  remote  SSj,  each  entry  in  the  local  time  array  t,  is  set  to  the 
maximum  of  t,  and  the  remote  time  array  received  (ty).  In  essence,  node  i  now  has  a  view  representing 
both  nodes  i  and  j  through  the  times  given  in  the  new  timestamp  array. 

3.2  Details  Concerning  the  Base  Algorithm  and  Resolution  Tables 

In  this  algorithm,  the  Clock  is  assumed  to  provide  real  time.  However,  as  discussed  later,  it  is 
possible  to  consider  the  Clock  function  as  simply  a  monotonic  strictly  increasing  function.  Assuming 
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that  the  Clock  function  reflects  time  is  particularly  attractive  since  failures  do  not  require  special 
corrective  action  to  ensure  the  monotonicity  property  This  Clock  property  is  stated  below: 


Clock* +  \  >  Clock*  for  all  executions  q  of  the  Clock  function 


It  is  assumed  for  this  presentation  that  item  names  satisfy  restriction  R2  by  using  unique  names 
generated  by  the  U mquename  function.  This,  of  course,  is  not  required  in  applications  in  which 
duplicate  names  are  impossible. 


The  procedure  Resolve  processes  new  synchronization  sets  against  local  synchronization  sets.  Every 
entry  in  each  SS  must  be  examined  If  the  other  SS  contains  an  entry  with  the  same  data  item  name, 
then  the  two  entries  must  be  resolved  and  only  one  entry  kept.  Fake  entries  are  created  if  only  one  of 
the  SS's  contain  a  particular  data  item  name  change  entry;  see  the  following  paragraph  The  order  of 
processing  each  synchronization  set  is  unimportant,  but  an  entry  must  be  processed  only  once. 


Resolution  tables  are  used  to  apply  two  synchronization  sets  against  each  other  to  update  the 
database  view  and  create  a  new  synchronization  set  which  includes  the  most  current  information 
concerning  changes  The  table  format  has  been  extended  for  simplicity  to  include  rows  and  columns 
to  represent  entries  which  may  be  present  in  one  of  the  synchronization  sets,  but  absent  in  the  other 
This  permits  the  resolution  table  to  he  used  uniformly  There  arc  two  possible  reasons  why  a 
particular  entry  could  be  missing  the  result  of  the  change  has  already  been  seen  and  removed  or  the 
result  of  the  change  has  never  been  seen  Each  axis  includes  the  lines  AbsentSeen  and 
AbsentNotSeen  to  represent  these  conditions  Refer  to  the  procedure  Resolve  and  the  type  and 
variable  definitions  on  the  following  pages  for  the  definition  and  use  of  changeitem 


AbsentSeen 


•  there  is  no  entry  with  the  specified  name  in  the  associated 
synchronization  >et  (absent  from  SS), 

•  and  by  the  associated  t  we  know  that  changes  have  been  processed 
through  the  value  in  the  time  array  (seen  by  the  node).  That  is: 


t(z.cn|  S  z.ct,  for  some  changeitem  z 


AbsentNotSeen 


•  there  is  no  entry  with  the  specified  name  in  the  associated 
synchronization  set  (absent  from  SS), 

•  and  by  the  associated  t  wc  have  definitively  not  seen  the  change 
(notseen  by  the  node).  That  is: 


t(z  enj  <  z.ct,  for  some  changeitem  z 


The  procedure  Perform  Action  used  in  Resolve  is  simply  a  dummy  procedure  which  represents 
performing  the  actions  defined  in  the  resolution  table  on  both  SS  and  V.  Thus, 


Perform  Action  (x,  y,  SS,  V) 


represents  using  the  x.op  and  y  op  fields  (of  the  changeitems  x  and  y)  to  select  the  appropriate  x  axis 
and  y  axis  array  positions  in  the  table.  The  actions  specified  at  that  locution  are  then  to  be  performed 
on  SS  and  V  (Note  that  a  dummy  x  or  y  entry  is  created  if  the  item  is  missing  from  the  corresponding 
synchronization  set;  the  op  field  is  set  accordingly. ) 


In  all  of  the  resolution  tables  presented  below  Update  conflicts  are  resolved  in  favor  of  the  higher 
change  time  (ct);  that  is,  probably  the  latest  change  wins.  It  is  possible  that  the  latest  change  may 
not  win,  though,  because  the  different  node  clocks  may  not  be  physically  synchronized.  It  is  also 
possible  for  two  changes  to  be  made  at  exactly  the  same  time.  This  conflict  is  resolved  by  using  a  total 
ordering  on  the  node  numbers.  This  algorithm  does  require  that  the  Clock  functions  of  the  individual 
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The  basic  algorithm  is  assumed  to  be  replicated  at  ail  nodes.  Three  basic  data  structures  are  used  in 
the  algorithm;  each  node  has  a  separate  set  of  these  variables: 

V,  the  database  view  for  node  i. 

SS,  a  list  at  node  i  of  changes  which  may  not  have  been  seen  by  the  other 

nodes.  (SS  represents  a  s>  nchronization  set.) 

t,  a  timestamp  array  which  details  how  current  node  i's  knowledge  is  of 

every  other  node. 

Both  SS  and  t  are  transmitted  to  some  collection  of  remote  nodes  upon  a  client’s  Transmit  request.  V 
is  not  sent  between  nodes  except  during  a  cold  start  of  a  node;  see  Section  5.6. 

When  a  change  occurs  at  some  node  i,  the  change  is  reflected  in  the  SS,  and  the  database  view  V,  The 
change  is  marked  with  the  current  value  of  a  node-relative  Clock.  Synchronization  sets  contain  at 
most  one  entry  for  every  item  name  changed.  A  particular  change  may  be  superceded  at  any  time, 
either  before  leaving  the  originating  node  or  at  some  later  intermediate  destination.  Since  u  change 
may  be  removed  from  the  SS  before  all  nodes  have  seen  that  change,  another  method  is  used  to  permit 
a  node  to  determine  when  the  changes  have  been  processed  by  remote  nodes.  The  timestamp  array  t 
is  used  for  this  purpose  This  array  is  indexed  by  node  number.  The  value  of  each  entry  represents  a 
node-relative  Clock  number.  For  example,  if  t,|5|  -  3,  then  this  means  that  node  i  has  seen  the  result 
of  all  changes  from  node  5  through  time  3  (relative  to  node  5).  We  use  the  term  result  here  because 
changes  can  be  superceded  in  the  synchronization  set  at  any  time.  Thus  a  node  may  never  see  certain 
changes,  it  could  see  some  newer  change. 

In  the  propagation  approach  a  node  i  maintains  an  SS  entry  for  every  change  entry  applied  to  the 
database  locally  until  node  i  is  sure  every  other  node  has  received  the  change.  When  a  SS  arrives  it  is 
merged  with  the  local  SS.  Rntries  may  be  added  or  deleted  to  both  V  and  SS  according  to  the 
resolution  table.  Removal  of  a  change  entry  from  a  node's  SS  can  occur  in  one  of  two  ways: 

case  1:  A  node  can  be  passed  a  SS  containing  a  change  entry  which  has  been  seen  by  all  nodes 
except  for  the  receiving  node. 

case  2:  A  node  can  receive  a  SS  which  docs  not  contain  the  change  item  and  the  received  t  array 
shows  that  the  sending  node  has  seen  a  result  from  the  node  where  the  change 
originated  at  least  through  the  time  when  the  change  occurred.  Because  the  sending 
node  definitely  has  seen  the  change  and  does  not  have  the  change  entry  on  the  SS,  we 
know  (by  induction)  that  case  l  must  have  occurred  at  some  node  in  the  past.  Thus  the 
change  entry  at  the  receiving  node  cun  be  deleted. 

In  the  no  propagation  approach,  the  node  performing  the  change  is  the  only  node  which  maintains  the 
change  entry  on  the  SS.  The  other  nodes  perform  the  change,  but  do  not  change  their  SS  An 
originating  node  i  can  determine  when  an  entry  has  been  seen  by  all  nodes  when  it  receives  t  arrays 
from  all  remote  nodes  which  reflect  a  time  for  node  i  greater  than  the  time  when  the  change  was 
performed. 

After  node  i  receives  and  processes  some  remote  SSy,  each  entry  in  the  local  time  array  t,  is  set  to  the 
maximum  of  ti  and  the  remote  time  array  received  (ty).  In  essence,  node  i  now  has  a  view  representing 
both  nodes  <  and  j  through  the  times  given  in  the  new  timestamp  array. 

3.2  Details  Concerning  the  Base  Algorithm  and  Resolution  Tables 

In  this  algorithm,  the  Clock  is  assumed  to  provide  real  time.  However,  as  discussed  later,  it  is 
possible  to  consider  the  Clock  function  as  simply  a  monotonic  strictly  increasing  function.  Assuming 
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that  the  Clock  function  reflects  time  is  particularly  attractive  since  failures  do  not  require  special 
corrective  action  to  ensure  the  monotonicity  property  This  Clock  property  is  stated  below: 

Cj.  Clock q  +  j  >  Clockq  for  all  executions  q  of  the  Clock  function 

It  is  assumed  for  this  presentation  that  item  names  satisfy  restriction  R2  by  using  unique  names 
generated  by  the  U niquename  function.  This,  of  course,  is  not  required  in  applications  in  which 
duplicate  names  are  impossible. 

The  procedure  Resolve  processes  new  synchronization  sets  against  local  synchronization  sets.  Every 
entry  in  each  SS  must  be  examined  If  the  other  SS  contains  an  entry  with  the  same  data  item  name, 
then  the  two  entries  must  be  resolved  and  only  one  entry  kept.  Fake  entries  are  created  if  only  one  of 
the  SS’s  contain  a  particular  data  item  name  change  entry;  see  the  following  paragraph.  The  order  of 
processing  each  synchronization  set  is  unimportant,  but  an  entry  must  be  processed  only  once. 

Resolution  tables  are  used  to  apply  two  synchronization  sets  against  each  other  to  update  the 
database  view  and  create  a  new  synchronization  set  which  includes  the  most  current  information 
concerning  changes  The  table  format  has  been  extended  for  simplicity  to  include  rows  and  columns 
to  represent  entries  which  may  be  present  in  one  of  the  synchronization  sets,  but  absent  in  the  other 
Phis  permits  the  resolution  table  to  he  used  uniformly  There  are  two  possible  reasons  why  a 
particular  entry  could  be  missing  the  result  of  the  change  has  already  been  seen  and  removed  or  the 
result  of  the  chjnge  has  never  been  seen.  Each  axis  includes  the  lines  AhsentSeen  and 
AbsentNotSeen  to  represent  these  conditions  Refer  to  the  procedure  Resolve  and  the  type  and 
variable  definitions  on  the  following  pages  for  the  definition  and  use  of  changeitem 

AbsentSeen  •  there  is  no  entry  with  the  specified  name  in  the  associated 

synchronization  >et  (absent  from  SS), 

•  and  by  the  associated  l  we  know  that  changes  have  been  processed 
through  the  value  in  the  time  array  (seen  by  the  node).  That  is: 

tlz.cnl  2  z.ct,  for  some  changeitem  z 

AbsentNotSeen  •  there  is  no  entry  with  the  specified  name  in  the  associated 

synchronization  set  (absent  from  SS), 

•  and  by  the  associated  t  we  have  definitively  not  seen  the  change 
(notseen  by  the  node).  That  is: 

t(z  enj  <  z.ct,  for  some  changeitem  z 

The  procedure  Perform  Action  used  in  Resolve  is  simply  a  dummy  procedure  which  represents 
performing  the  actions  defined  in  the  resolution  table  on  both  SS  and  V.  Thus, 

Perform  Action  (x,  y,  SS,  V) 

represents  using  the  x.op  and  y  op  fields  (of  the  changeilems  x  and  y)  to  select  the  appropriate  X  axis 
and  y  axis  array  positions  in  the  table.  The  actions  specified  at  that  locution  are  then  to  he  performed 
on  SS  and  V  (Note  that  a  dummy  x  or  y  entry  is  created  if  the  item  is  missing  from  the  corresponding 
synchronization  set;  the  op  field  is  set  accordingly  .) 

In  all  of  the  resolution  tables  presented  below  Update  conflicts  are  resolved  in  favor  of  the  higher 
change  time  (ct);  that  is,  probably  the  latest  change  wins.  It  is  possible  that  the  latest  chunge  may 
not  win,  though,  because  the  different  node  clocks  may  not  be  physically  synchronized.  It  is  also 
possible  for  two  changes  to  be  made  at  exactly  the  same  time.  This  conflict  is  resolved  by  using  a  total 
ordering  on  the  node  numbers.  This  algorithm  does  require  that  the  Clock  functions  of  the  individual 
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nodes  be  logically  synchronized  (Lamp78|.  This  is  accomplished  within  the  Receive  procedure. 
Section  5  discusses  alternative  methods  for  conflict  resolution. 


3.3  The  Base  Algorithm  and  the  First  Resolution  Table 


The  base  algorithm  is  presented  on  the  following  pages  The  notation  for  the  algorithm  is  based  on  a 
derivative  of  Pascal.  Some  additional  notation  is  used  to  avoid  trivial  details.  The  first  resolution 
table  is  presented  in  Figure  3-1.  This  table  represents  the  independent  /  propagation  type  of  system 
structure.  The  other  resolution  tables  arc  presented  in  Section  4. 


There  are  a  variety  of  simple  modifications  possible  for  the  base  algorithm  separate  from  the 
resolution  table.  These  algorithm  modifications  are  discussed  in  Section  5.  Because  the 
modifications  are  so  simple  we  consider  all  the  possible  derived  algorithms  to  be  part  of  the  suite. 


Nv>\v^ -;X*. 
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Types  and  Variables 


Primitive  Types: 

string 

integer 

ts 

value 


{the  type  returned  from  Ihe  Clock  function} 
{the  type  for  the  items  in  the  database} 


Defined  Types: 


node 


l.MaxNodes 


itemname 


item 


chungeitem 


tsarray 

changeset 

message 


record 

itemstring 

creator 

creationtime 

end; 

record 

itcmn 

val 

end; 

record 

citem 

op 

cn 

ct 

knownby 

end; 

arrayl  nodel  of  ts; 

set  of  changcitem; 

record 

from 

rcmotcT 

remoteSS 

end; 


string; 

node; 

ts 


itemname. 

value 


item. 

(Insert.  Update,  Delete,  AbsentScen, 

AhscnlXot  Seen); 

node. 

ts; 

set  of  node 


node; 

tsarray; 

changeset 


( 


tlohal  Variables  (for  each  node): 


V 

set  of  item. 

{this  node's  view  of  the  database} 

ss 

changeset; 

{the  synchronizing  set} 

t 

tsarray; 

{Time  array;  e  g.,  t|5|  =3  =*  this  node  has  seen 
the  result  of  all  changes  from  node  5  through 
time  3  (relative  to  node  5)} 

allnodes 

set  of  node; 

{the  current  list  of  all  nodes  which  can  view  the 
database} 

i 

node. 

{the  node  number  of  this  node} 
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function  Uniquename  (xstring  :  string):  itemname 
begin 

Uniquename  =  <  xstring,  i,  Clock  > 

end; 


function  Insert  (xnainc:  itemname,  vin  :  value):  (ok,  alreadyexists) 

var 


X 

item. 

time 

ts; 

r 

changeset; 

local 

boolean; 

begin 


if  xnnmc  €  V  itcinn  then  Insert :  =  alreadyexists. 
time  .  —  Clock; 
x  :=  <xnamc,  vin> 
t(i)  =  time; 

r:=  {<x.  Insert,  i,  time,  {i}>} 
local :  =  true; 

Resolve  (local,  SS,  t.  r,  t,  Vl 
Insert :  =  ok 


function  Update  (x name:  itemname;  vin  :  value):  (ok,  nonexistent) 
var 


item; 

ts; 

changeset, 

boolean. 


begin 


if  xnumc  t  V  itemn  then  Update  ;  =  nonexistent; 
time  ;  =  Clock; 
x  ;  =  <xnamc,vin> 
l(i) :  =  time, 

r  =  {<x.  Update,  i,  time, {i}>} 
local :  =  true; 

Resolve  (local,  SS,  t,  r,  t,  V) 

Update :  =  ok 


•• 
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function  Delete  (xname:  itemname):  (ok,  nonexistent) 


X 

item; 

time 

ts. 

junk 

value, 

r 

changeset; 

local 

boolean. 

begin 


=  nonexistent: 


if  xname  $  V  itemn  then  Delete 

time  :  =  Clock; 

x:  =  <  xname,  junk  > 

t(i) :  =  time; 

r  -  { <  x.  Delete,  i,  time,  {i}>} 
local :  =  true; 

Resolve  (local,  SS,  t,  r,  t,  V) 
Delete  :  =  ok 


function  List  (wanted:  set  of  itemname):  set  of  item 
begin 

return  from  V  the  wanted  items,  if  present 

end; 


procedure  Transmit 
begin 

Saved,  SS,  V'). 

Sendi  <  i,  t,  SS> ) 

end; 

procedure  Receive  I  m  message) 


{save  in  permanent  storage  all  changes  since  the  last  save. 
This  can  be  done  by  an  incremental  log.} 

{Send  the  view  information  (represented  by  SS)  in  a  message 
to  some  set  of  other  nodes} 


begin 


local  boolean; 


local :  =  false; 

Resolve( local,  SS,  t,  m.remoteSS,  m.remoteT,  V); 
if  Clock  s  max  {m.remoteTfjl  |  1  s  j  s  MaxNodes}  then 

Clock:  =  max  {m.remoteTljll  1  S  j  S  MaxNodes}  +  1 


s  s 
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K-‘ 


V 

t>:> 


■*? 


& 


Ry 

k 


m 


$ 

-■v 


procedure  Resolve  ( 


local  :  boolean; 
var  old  SS  :  changeset; 
var  newSS  :  changeset; 
varV  :  set  of  item) 


var  oldt  :  tsarray; 

newt  :  tsarrav; 


var 


tcmpoldSS 

matchitcms 

x 

y 


changeset; 

changeset; 

changeitem; 

changeitem; 


begin 


tempoldSS  =  oldSS; 


{note:  In  practice  oldSS  does  not  need  to  he  copied.  It  is  shown  this 
way  for  clarity} 


if  not  local  then  begin 

for  each  x  £  tempoldSS  do 
begin 

matchitcms  :  =  {z  £  newSS  I  x.citem. itemn  =  z.citcm  itemn} 
newSS  :  =  newSS  -  malchitems 

if  matchitems  =  0  then  begin  {make  a  dummy  changeitem  entry} 
ifnewt(x.cn)  >  x.ct  then  y.op  :  =  AbsentSoen 
else  y.op :  =  AbsentNotSeen 

end 

else  lety  £  matchitems;  {note:  I  matchiti  ms  -  1 


end 


Perform  Action  (x,  y,  oldSS,  V) 


end; 


for  each  y  £  newSS  do 
begin 

matchitems  :=  {z  £  tempoldSS  I  y.citem. itemn  =  z.citcm  itemn} 
if  matchitems  =  0  then  begin  {make  a  dummy  changeitem  entry} 

if oldtly.cn ]  a  y.ct  then  x  op  :  =  AbsentSeen 
elsexop:=  AhscntNotSccn 

end 

else  let  x(  matchitems:  {note:  I  matchitems  I  =  1} 


end; 


Perform  Action  (x,  y,  oldSS,  V) 


end; 


oldtfjl :  =  max  {oldt{j|,  newt(j)  |  1  sj  s  MaxNodes} 


A  %  %  A  . 


r.'X.'Wy 


.'.V  .%j.v .s.v 


'•1 


;  •;  sr*. 


m 


'/n’.V 


.  i 


•  v  >;  i 


(s' 


‘  \1 
1 

■\  *  -  1 


1  v  *  ’  v  v  •/  •’  -J 


Insert 


Insert 


SSM(x.y) 

V:nc 


Update 


SSnc 
V  nc 


Update  SSK(x.y) 
V.R(v) 


Delete  SSR(x.y) 
V:  I)(x) 


Delete 


SS  nc 
Vne 


SSnc 

V:nc 


AbsentSeen 


SS:nc 

V:nc 


SSnc 

V:nc 


Absent 

Seen 


Absent 

Not 

Seen 


SS  IXx) 
V:nc 


SS:R(x,y) 
V  :D(x) 


SS:D(x) 
V  nc 


SS  nc 
V:nc 


SS:M(x,y) 

V:nc 


SS:D(x) 

V:nc 


SSnc 

V:nc 


SSnc 

V:nc 


AbsentNotSeen 


SS  A(y) 
VAIvl 


SS:A(y) 

V:A(y)  if  absent 
R(y)  if  present 


SS:A(y) 

V:D(y)  if  present 


nc: no  change 


*i  if(x.cn  =  y  cn)  and  (x.ct  =  y.ct)  then  M(x,y) 

else  if  (x  ct  <  y.ct)  or  (x.ct  =  y.ct  and  x.cn <y  cn)  then  R(x,  y) 


*_>  R(y)  if  x  replaced  in  *| 


SS  A(j) 

=  Add  j  to  oldSS 

union  local  node  number  into  knownby  of  j 
if  knownby  =  allnodes,  then  Dg) 

R(j.k) 

=  Replace  j  on  oldSS  with  k 

union  local  node  number  into  knownby  of  j 
if  knownby  =  allnodes,  then  Dg) 

D(j) 

=  Delete  j  on  oldSS 

M(j,k) 

3  Merge  knownby  sets  into  j 

if  knownby  =  ultnodes,  then  Dg') 

V:  A(j) 

■  Add  j.citem  to  V 

Rg) 

■  Replace  the  item  with  the  same  name  in  V  with  j.citem 

ng> 

■  Delete  item  with  the  name  j.citem. itemn  from  V 

Figure  3-1  Resolution  Table  for  Propagation  /Independent 
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4.  Other  Resolution  Tables 


The  resolution  table  presented  with  the  base  algorithm  (Figure  3-1)  assumes  that  every  node  should 
propagate  a  change  and  that  items  can  be  changed  anywhere  throughout  the  network  (i.e the 
database  is  independent).  There  are  a  variety  of  other  assumptions  which  can  he  supported  by  simply 
altering  the  resolution  table  provided  with  the  base  algorithm. 

First,  we  consider  the  situation  where  the  database  is  dependent.  Recall  that  changes  cun  occur  to  a 
data  item  only  at  the  node  which  originally  created  the  item  in  this  type  of  database.  In  addition,  we 
assume  that  other  nodes  are  not  responsible  for  ensuring  the  changes  are  seen  by  every  other  node; 
the  node  making  the  change  is  responsible  for  verifying  this  (viz.,  no  propagation).  This  problem  is 
somewhat  trivial,  but  nevertheless  quite  common.  Consider  operating  system  load  tables  which 
specify  the  current  load  information  for  the  node.  This  is  then  used  by  other  nodes  in  some 
decentralized  load  distribution  procedure.  Clearly,  only  one  node  will  be  changing  the  load 
information  and  if  the  changing  node  fails  there  is  little  reason  for  concern  over  change  propagation 
Figure  4-1  contains  the  resolution  table  for  this  problem.  Note  that  there  arc  no  Update  conflicts  in 
this  example,  so  logical  Clock  synchronization  is  actually  not  needed.  If  the  database  was  dependent, 
but  propagation  was  desired,  then  the  resolution  table  of  Figure  3-1  would  be  used. 

The  second  alternative  resolution  table  we  consider  represents  another  common  problem:  even 
though  changes  can  occur  at  any  node  (independent),  propagation  of  this  information  by  every  node  in 
the  network  is  not  required  (no  propagation).  This  may  be  reasonable  in  environments  such  as  high 
speed  contention  or  ring  based  local  area  networks  in  which  the  nodes  appear  fully  connected.  Thus 
the  originating  node  for  a  change  is  responsible  for  ensuring  that  every  other  node  learns  of  th" 
change  Figure  4-2  contains  the  resolution  table  which  specifies  the  actions  to  be  taken  in  this 
environment. 

The  following  table  summarizes  the  requirements  satisfied  by  the  different  resolution  tables 


requirements 

dependent 

independent 

propagation 

Figure  3-1 

Figure  3-1 

no  propagation 

Figure  4-1 

Figure  4-2 
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Insert 


Seen 


! - 

Insert 

Update 

Delete 

AbsentSeen 

AbsentNotSeen 

SS:- 

SS- 

SS:- 

SS:nc 

SS:  A(v)  if  yen  =  i 

V 

V 

V:- 

V:nc 

V:  Aty) 

SS:R(x,y) 

SS:R(x,y) 

SS:- 

SS:nc 

SS:A(y)  ify.cn  =  i 

V:R(y) 

V:R(y) 

V:- 

V:nc 

V:A(y) if  absent 

R(y)  if  present 

SS:R(x,y) 

SS:R(x,y) 

SS:- 

SS:nc 

SS:A(y)  ify.cn  =  i 

V:D(x) 

V :  D(  x ) 

V:- 

V:nc 

V:I)(y)  if  present 

SS:K(x) 

SS:  K(x) 

SS;K(x) 

SS:- 

SS: 

V  nc 

V:nc 

V:nc 

V:- 

V:- 

SS  nc 

SSnc 

SS:nc 

SS: 

SS: 

Vnc 

V:nc 

V:nc 

V:- 

V 

nc:  no  change 

SS:  A(j) 

a  Add  j  to  oldSS 

union  local  node  number  into  knownby  of  j 
if  knownby  =  allnodes,  then  D(j) 

Rlj.k) 

=  Replace  j  on  oldSS  with  k 

union  local  node  number  into  knownby  of  j 
if  knownby  =  allnodes,  then  D(j) 

IX.]) 

=  Delete  j  on  oldSS 

K(j) 

=  Union  remote  node  number  into  knownby  ofj 

if  knownby  =  allnodes,  then  D(j) 

V:  A(j) 

*  Add  j.citem  to  V 

R(j) 

“  Replace  the  item  with  the  same  name  in  V  with  j.citem 

D(j) 

■  Delete  item  with  the  name  j.citem. itemn  from  V 

Figure  4-1  Resolution  Table  for  No  Propagation /Dependent 
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Insert 


Update 


Delete 


Absent 

Seen 


Absent 

Not 

Scon 


Insert 

Update 

Delete 

AbsentSeen 

AbsentNotSeen 

SS:- 

SS.nc 

SS:nc 

SS:nc 

SS:A(  v)  if  v  cn  =  i 

V:- 

V:nc 

V:nc 

V:nc 

VAly) 

SS:S(x,y) 

SS:*, 

SS:nc 

SS:nc 

SS:A(y)  if y  cn  =  i 

V:R(y) 

V:*2 

V:nc 

V:nc 

V  A(y)  if  absent 

Rty) if  present 

SS:S(x,y) 

SS  Stx.y) 

SSnc 

SS:nc 

SS:A(y)  if  y  en  =  i 

V:D(x) 

V  Dtx) 

V:nc 

V.nc 

V:I)(y)  if  present 

SS  Ktx) 

SS  Ktx) 

SS:K(x) 

SS- 

SS  - 

V.nc 

V:nc 

V:nc 

V:- 

V 

SS:  nc 

SS  nc 

SS:nc 

SS: 

ss- 

V  nc 

V  nc 

V.nc 

V. 

V 

nc: no  change 

*,:  iftx.cn  =  y.cn)  and  (x.ct  =  y.ct)  then  R(x,y) 

elseiftxct  <  y.ct)  ortx.ct  =  y.ct  and  x.cn<y.cn)  then  D(x) 


*2  Rty)  if  x  replaced  in  *, 


SS:  Atj) 

a  Add  j  to  oldSS 

union  local  node  number  into  knownby  of  j 
ifknownby  =  allnodes,  then  D(j) 

R(j.k) 

a  Replace  j  on  oldSS  with  k 

union  local  node  number  into  knownby  ofj 
ifknownby  =  allnodcs,  then  Dtj) 

Dtj) 

a  Delete  j  on  oldSS 

Stj.k) 

a  ifj.cn  =  k.cnundj.ct  =  k.ct  then  R(j,k) 

else  D(j) 

Ktj) 

a  Union  remote  node  number  into  knownby  ofj 

ifknownby  =  allnodes,  then  Dtj) 

V:  A(j) 

■  Add  j.citem  to  V 

Rtj) 

a  Replace  the  item  with  the  same  name  in  V  with  j.citem 

Dtj) 

«  Delete  item  with  the  name  j  .citem  .  itemn  from  V 

Figure  4-2  Resolution  Table  for  No  Propagation  /  Independent 
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5.  Variations 


The  following  sections  discuss  extensions  of  the  base  algorithm  and  resolution  table.  The  particular 
variations  presented  adapt  the  basic  scheme  to  accomodate  a  variety  of  different  application 
requirements 

5.1  Sending  Individual  Changes  Immediately 

When  the  synchronization  set  is  sent  from  a  node,  all  changes  which  may  not  have  been  seen  by  some 
other  node  are  sent  Because  this  set  may  only  be  sent  occasionally  by  some  applications,  it  is 
desirable  to  consider  the  possibility  of  sending  a  change  immediately  (without  the  remaining 
members  on  the  synchronization  set).  Whether  this  is  appropriate  depends  on  several  factors.  If 
changes  are  rapid,  a  substantial  load  on  the  network  could  result.  This  is  possible  because  changes 
arc  overwritten  in  the  synchronization  set  as  soon  as  they  are  detected.  If  changes  are  sent 
immediately,  then  some  changes  could  be  sent  which  would  not  have  been  in  the  base  algorithm 
There  are,  however,  a  variety  of  applications  which  could  benefit  by  the  rapid  distribution  of  a 
change  The  synchronization  set  would  be  transmitted  as  a  backup  precaution  to  ensure  that  all 
changes  are  eventually  acknowledged. 

The  same  resolution  approach  can  be  used  to  solve  this  problem.  However,  care  must  be  taken 
because  each  change  sent  is  independent  from  the  preceding  one.  The  receiving  node  cun  not 
determine  whether  all  preceding  changes  have  been  seen  or  not.  That  is,  receiving  a  particular 
change  from  some  node  j  does  not  imply  reception  of  all  previous  changes  from  node  j.  Therefore, 
when  the  change  is  received  the  new  change  should  be  resolved  against  any  changes  of  the  same 
name  in  the  local  SS,  but  the  local  SS  entries  should  not  be  resolved  against  absent  entries  in  the 
incoming  SS  This  is  exactly  what  is  required  when  performing  changing  locally  at  a  node  and  thus 
the  local  variable  is  set  to  true 

Below  are  the  code  fragments  to  accomplish  sending  changes  immediately.  Another  message  type  is 
defined  which  is  sent  for  every  Insert,  Update,  or  Delete  performed.  We  will  assume  that  the  two 
message  types  can  be  distinguished. 


•  In  Insert,  Update  and  Delete  immediately  before  returning  ok: 

Set  r  to  have  a  knownby  list  of  0 

Save  (t,  SS,  V);  {Incremental  save} 

Sendit  <  i,  r  > );  {Special  send  of  single  change} 

•  Add  a  new  Receive  operation: 

procedure  Receives  (m:  record  from  :  node,  rcmoteSS  :  changeset;  end) 


var 

local  boolean; 


begin 

local :  =  true,  {pretend  it’s  local} 

Resolve  (local,  SS,  t,  m  remoteSS,  t,  V) 

end; 


i 

i 
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5.2  Specifying  Conflict  Strategies  for  Ordering  Update  Operations 

All  of  the  resolution  tables  presented  thus  far  have  considered  only  retaining  the  most  recent  change. 
As  mentioned  previously,  this  is  not  always  achieved  because  the  clocks  may  not  be  physically 
synchronized.  (However,  in  environments  such  os  local  area  networks,  the  clocks  are  usually  very 
close.)  Even  if  the  the  change  which  would  be  retained  was  the  most  recent,  this  may  he 
inappropriate  for  some  applications.  That  is,  sometimes  it  may  be  desirable  to  choose  the  earlier 
change  rather  than  the  later  one.  For  example,  if  two  clients  conflict  when  changing  an  item  in  a 
reservation  database,  it  is  the  earlier  change  which  should  probably  win. 

Using  just  older  and  newer  as  the  only  conflict  resolution  strategies  is  still  overly  restrictive.  There 
are  several  other  functions  which  could  be  used  to  resolve  conflicts.  The  functions  Maximum  and 
Minimum,  for  example,  appear  quite  well  suited  for  conflict  resolution  for  some  application  data 
items  in  reservation  and  similar  systems.  Any  (commutative)  function  which  totally  orders  the  data 
values  will  suffice.  Note  that  in  the  base  resolution  table,  node  numbers  were  totally  ordered  and 
used  to  break  Update  conflicts  which  tied  on  their  change  times.  This  was  used  because  each  node  has 
a  separate  execution  agent  and  thus  could  not  create  the  needed  total  order. 

A  trivial  extension  to  address  different  conflict  resolution  strategies  for  each  type  of  data  item  is  to 
include  with  each  item  (when  Inserted )  the  type  of  conflict  resolution  strategy  which  should  he 
performed  on  Update  conflicts. 

5.3  Functional  Operations 

Update  operations  replace  the  value  of  u  data  item  in  the  view.  This  prompts  the  Update  i  Update 
conflicts  which  must  be  resolved  through  some  type  of  total  ordering  on  the  changes.  There  are  a 
variety  of  operations  which  do  not  have  this  inherent  conflict  problem.  For  example,  the 
commutative  operations  of  Increment  and  Decrement  can  not  conflict  since  the  result  would  he  the 
same  regardless  of  the  order  executed.  Thus,  items  in  the  database  could  be  marked  as  being 
manipulated  only  through  some  specified  set  of  functional  operations  and  avoid  all  conflicts.  The 
changes  to  the  resolution  table  would  be  quite  simple.  One  new  column  and  one  new  row  must  be 
added  for  functional  operations.  Instead  of  replacing  entries  on  the  synchronization  set,  functional 
changes  must  add  new  entries.  As  the  entries  are  verified  to  have  been  seen  by  all  nodes,  the  entries 
are  deleted  as  before.  It  is  assumed  that  data  items  which  use  functional  operations  can  not  be 
manipulated  through  the  Update  operation.  If  a  Delete  operation  is  performed,  then  all  functional 
entries  on  the  synchronization  set  should  be  removed.  Thus,  an  Insert  can  be  performed  followed  by 
any  number  of  functional  operations  and  finally  followed  by  a  Delete  operation.  The  modifications  to 
the  resolution  table  are  straightforward  and  are  not  shown  here. 

5.4  Atomic  Changes 

The  atomic  operations  (which  change  the  databuse)  presented  thus  far  arc  the  primitives  Insert, 
Update,  and  Delete.  If  it  was  desired  to  combine  these  operations  into  a  larger  transaction,  then  the 
transaction  would  not  maintain  the  same  properties  as  the  smaller  operations.  Since  each  change  to  a 
view  receives  a  Clock  timestamp,  it  is  not  possible  to  ensure  that  multiple  changes  will  be  treated 
uniformly  with  respect  to  conflicts.  What  may  be  desired  in  certain  cases  is  that  multiple  changes 
either  all  win,  or  all  lose  in  a  conflict.  One  alternative  is  to  assign  the  same  Clock  time  to  every 
change  in  the  transaction.  This  guarantees  that  if  two  transactions  containing  only  Update 
operations  manipulate  the  same  items,  then  the  transactions  can  be  serially  ordered. 
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5.5  Limiting  the  Size  of  Synchronization  Seta 

Changes  remain  on  the  synchronization  sets  of  all  nodes  responsible  for  information  propagation 
until  all  nodes  have  acknowledged  the  change.  During  node  and  network  failure  the  sets  could 
become  quite  large.  This  is  the  cost  for  not  passing  the  entire  view  around  the  network.  If  data  items 
arc  not  deleted,  then  the  size  of  each  synchronization  set  is  bounded  by  the  size  of  the  view.  There 
could  be  a  change  for  every  data  item  in  the  view,  but  since  changes  are  overwritten  if  an  entry 
already  exists,  the  set  size  does  not  change  regardless  of  failure  duration.  If,  however.  Delete 
operations  occur,  then  the  simplistic  scheme  presented  thus  far  would  allow  the  synchronization  set 
to  become  unbounded.  There  appear  to  he  two  straightforward  solutions  to  this  problem  Kach  of 
these  is  discussed  below. 

First,  the  SS  could  be  limited  to  contain  only  n  members  with  each  node  i  owning  O,  members.  The 
nodes  could  be  assigned  different  amounts,  provided  that  each  SS  has  sufficient  space  for  all  the 
entries  That  is, 


MoxXiide* 
n  =  V  () 
i  =  l 

If  a  local  client  makes  a  request  of  the  system  and  its  allocation  on  the  SS  is  depleted,  then  no  Inserts 
or  Deletes  should  be  accepted  Updates  can  he  accepted  only  if  the  item  is  already  in  the  node's  SS 
This  allows  all  remote  node  synchronization  sets  to  be  accepted.  This  is  of  course  a  pessimistic 
strategy,  the  entire  system  could  stop  accepting  Inserts  and  Deletes,  if  a  single  node  fails  I  lowever,  in 
the  case  of  a  simple  node  failure,  it  is  relatively  simple  to  eliminate  the  failed  node  from  Alhwtles  and 
demand  that  the  failed  node  reinitialize  when  it  restarts  (see  Section  5.6).  It  is  much  more 
complicated  if  the  network  communication  system  has  failed  and  the  network  is  partitioned  The 
second  alternative  could  be  used  in  cases  where  this  solution  is  unacceptable. 

The  second  solution  involves  replacing  the  Delete  entries  on  the  synchronization  set  with  a 
DeleteRange  entry.  Two  Delete  entries  (related  to  the  same  node  responsible  for  some  change)  can  be 
combined  if  the  view  contains  no  intervening  view  items  created  by  the  same  node  This  is  true  even 
if  the  node  which  is  creating  the  DeleteRange  entry  did  not  delete  all  intervening  items  in  the  view. 
When  a  DeleteRange  entry  is  received,  it  can  be  expanded  to  match  all  items  in  the  range.  The  test  for 
intervening  is  made  on  the  clime  Held.  For  example.the  following  entry  would  delete  all  items 
created  by  node  number  5  from  time  1  through  time  8. 

DeleteRange  (creator  =  5.  dime  =  1.  8:cn  =  4) 

It  cun  be  proved  that  this  is  sufficient  to  bound  (within  a  proportionality  constant)  the  SS  size  to  the 
size  of  the  view. 

8.0  Online  Inclusion  /  Removal  of  Nodes 

Even  though  the  suite  supports  "automatic"  reintegration  of  nodes  supporting  the  database  in  most 
cases,  throughout  the  life  of  the  database  certain  nodes  may  fail  beyond  automatic  database  repair 
(e  g  ,  disk  crash).  In  addition,  nodes  may  be  added  or  removed  from  supporting  the  database.  These 
situations  require  a  means  for  a  node  to  resynchronize  with  the  current  members  of  Allnodes.  We  will 
refer  to  this  situation  as  cold  starting. 

Since  Allnodes  can  be  changing  over  the  life  of  the  database,  there  is  no  reason  not  to  place  Allnodes 
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directly  into  the  database.  The  value  then  propagates  naturally  throughout  the  network  when  a 
change  is  issued.  Removing  a  machine  from  the  participating  group  of  database  nodes  is 
straightforward.  However,  adding  a  new  member  requires  an  agreement  procedure  which  is  quite 
similar  to  that  of  two  phase  commitment  (Dole82,  Gray781.  A  sketch  of  the  procedures  is  given  below 

•  Removing  a  Node  i : 

Allnodes  :  =  Allnodes  -  {»} 

Update  Allnodes 

•  Cold  Starting  a  Node  i : 

V:=0;  Ukl:=OVk;  SS:=0;  Allnodes  :=  {i},  Stillbooting  :=  true;  Talkingnodes  :=  0; 

►  Send  request  for  boot  service  to  everyone;  pass  node  number  i 
All  receiving  nodes  with  Stillbooting  =  false  should  send  node  i  their  node  numbers 
For  all  nodes  which  respond  {before  timeout  limit),  add  their  node  numbers  to  Talkingnodes 
While  (Talkingnodes  *  0)  and  (Stillbooting)  do 
begin 

Pick  some  node,  say  /,  from  Talkingnodes  {picked  as  desired;  e  g.,  closest) 

Talkingnodes  :  =  Talkingnodes  ■  {j) 

Send  request  for  V,  SS,  and  t  to  node  j 

If  this  message  is  received  by  j,  then  j  must  add  i  to  Allnodes  before  replying 

If  i  receives j’s  view  information  (before  timeout  limit)  then 

begin 

create  a  new  SS  by  making  an  Insert  entry  for  every  item  in  the  database 
Merge  the  returned  remoteSS  into  SS  (through  standard  resolution) 
tl k  1 .  =  rcmoteTIkl,  Vk  *  i 
Stillbooting :  =  false 

end 

end 

if  Stillbooting  =  true  then  occasionally  request  boot  service  (by  repeating  above  from  ►) 

•  ?old  Starting  the  First  Node  i : 

V:=0;  t(kl:=0Vk;  SS:=0;  Allnodes  :  =  {«’);  Slillbooting  :=  false; 

If  there  are  no  nodes  which  respond,  then  the  node  is  free  to  continue,  however  it  must  occasionally 
attempt  to  communicate  with  other  nodes  which  may  be  supporting  the  database.  Note  that  this 
procedure  allows  nodes  to  join  the  current  group  of  view  communicating  nodes.  It  docs  not  support 
two  nodes  which  are  both  Stillbooting  to  share  information.  This  is  allowed  only  after  each  has  joined 
the  primary  group. 
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6.  A  Formal  Model  of  the  Base  Problem 

In  this  section  we  provide  a  formal  framework  for  considering  the  algorithms  presented  in  his  paper. 
In  addition,  the  correctness  condition  for  the  base  algorithm  and  resolution  table  is  given 

|,et  W  be  the  domain  of  values.  Let  D  be  the  domain  of  element  names.  Each  view  of  the  database 
then  is  a  subset  of  D  X  W. 

Let  BasicOps  =  { tnserl(x,y ),  Update(x,y),  Delete! x)  j  x  €  D,v  £  W}  Let  OtherOps  =  {List(Q)  \  Q  C  D}  U 
{Transmit! m).  Receive! m)  |  m  is  a  message^  And  finally,  let  AllOps  =  BasicOps  U  OtherOps. 

Fix  some  particular  execution  of  the  system.  Each  instance  of  an  operation  from  AllOps  corresponds 
to  an  event.  Let  E  be  the  set  of  all  events  occuring  in  the  particular  fixed  execution. 

Let  o £  :  K  -*  AllOps  be  the  operation  associated  with  each  event,  where  :  E  — *  node  be  the  node  at 
which  some  event  occurred,  and  when  :  E  — *  ts  be  when  the  event  happened  relative  to  the  Clock  at  the 
node  where  the  event  occurred.  That  is,  whenie)  =  Clock wherr(e). 

Define  -*  to  he  a  relation  on  E  X  E,  "happened  before”,  such  that 

Oj.  if  e\,  C2  €  E,  where(e\)  =  whereie 2),  and  op(e\)  is  performed  before  opfe?), 
then  ei  -»  eo 

O2.  ifex.eot  E,  op!e\)  =  Transmit (m\)  and  op(e<i)  =  Receive! m\).  then  e\-*C2 
03.  ife1.e2.e3C  E,  ei  — *  C‘i  and  e-i~*  ©3.  then  e\— ►  63. 

O4.  if  e  C  E.  then  e  -*  e. 

We  can  now  define  the  correctness  conditions  for  the  base  algorithm  and  resolution  table.  Recall  that 
that  approach  supports  any  node  making  changes  to  the  database  and  each  node  is  responsible  for 
ensuring  every  other  node  has  seen  some  change  Let  view  :  E  -*  2lD  *  w>be  defined  as  follows:  (x,y)  C 
viewte')  iff  there  exists  e  €  E  such  that 

V|.  e  -*  e'and  op(e)  C  {/ nse rt( x, y ).  Update! x,y)} 

V%.  (Ve)  [Re moved! e,e)  =>  “•  *•»  — ►  «f’Al 

where  Removedle.e)  *1  lopie)  =  Delete(x)\or 

lopte)  =  Update(x,y’)  and  o/Xe)  =  Insert! x,y)  and  y  *  v’lor 
[op(e)  =  Update! x,y')  and  op(e)  =  Update(x.y) 

and  Earlierle.e)  and  y  x  y’H 

and  Earlierie\,e2>  ■  Uwhen(e\)  <  whenfe2>\  or 

Iwhenfei)  =  when(e2)  and  where(e\)  <  where(e2)  11 
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To  prove  the  base  algorithm  and  resolution  table  correct  we  must  show  that  for  all  e\  viewfe')  =  Ve\ 
That  is,  the  formal  view  and  the  database  must  contain  the  same  set  of  (x,y)  after  every  event  e‘.  The 
following  additional  notation  is  required  for  the  proofs. 

V€(and  eV)  V  at  u/hcrefeiimmediately  after  (respectively  before)  completing  event  e 

t,(and,t)  ::=  tat  where(e)  immediately  after  (respectively  before)  completing  event  e 

SS,  (and  «SS)  ::=  SSat  whereie)  immediately  after  (respectively  before)  completing  event  e 

Because  SS’s  contain  representatons  of  events,  wc  will  refer  to  SS’s  as  if  they  actually  contain  events. 
Of  course,  only  events  from  BasicOps  have  such  representations.  Thus,  e  C  SS  implies  that  op(e)  ( 
BasicOps.  It  is  obvious  from  the  program  code  that  the  program  variables  cn  and  ct  for  some 
changeitem  contain  the  values  of  the  functions  where  and  when  for  the  event  associated  with  the 
particular  changeitem.  For  notational  convenience,  we  will  therefore  consider  where  and  when  to  be 
stored  with  each  event  which  is  in  a  SS. 

We  will  not  consider  any  operation  to  be  an  event  which  is  rejected  because  of  an  error.  Thus,  Hj  and 
R2  are  assumed  to  hold. 

In  the  proofs,  minimality  is  referenced.  Kvent  e  is  minimal  to  event  e'  with  respect  to  some  condition 
B  iff  e  — 1 ►  e'  with  condition  B  holding  after  event  e  and  there  docs  not  exist  an  event  d  such  that  d  —*  >• 
e’  and  condition  B  holds  after  event  d.  Thus  minimality  corresponds  to  the  concept  of  earliest. 
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Lemma  1 


if  e  -»  e’,  then 

(a)  t,(i)  «  te  (i) 

(b)  *t(i)  <i  etU)  if  whereie)  =  where(e') 

(c)  t,(i)  £  „t(i)  if  whereie )  =  whereie’). 

Proof 


By  inspection  and  the  Clock  property  Cj  .V 

Lemma  2 


if(x.y)  £  V,',  then  there  exists  e  £  E  such  that  e  is  the  event  which  placed  (x,y)  into  Ve-,  ople) 
£  {Insert(x.y),  Update(x.y)),  and  e  — ►  t. 


Proof 


By  inspection  of  the  resolution  table  (in  particular  axis  y:  Insert,  Update-,  axis  X:  all)  and 
induction  on  -*  with  initially  V  =  0,  it  is  clear  that  there  may  be  several  events  which 
precede  e'  and  which  could  have  placed  ( x.y )  into  Ve\  Obviously,  only  one  of  these  events 
actually  placed  (x,y)  into  V,\  Let  e  be  that  particular  event.  V 


Lemma  3 


if  ople)  £  BasicOps,  ople")  £  BasicOps,  e  -*  e\  and  e  x  e\  then  when(e)  <  whenle'). 


Proof 

(1)  if  uihere(e)  =  where(e'),  then  by  Cj:  whenle)  <  whenle’) 

(2)  if  where(e)  x  whereie’),  then  it  must  be  the  cose  that 

e  -»  e ”  -*  -*  e‘  with  where(e)  =  whereie")  and  whereie”’)  =  whereie’)  and  op(e ’’) 

=  Transmit(m\ ),  and  ople'")  =  Receivelm^) 

(3)  Thus  by  the  code  in  Receive,  we  ensure  that  whenle)  <  whenle’). V 

Lemma  4 


if  rp(e)  £  {/nsertlx.y),  Updute(x.y)}  then  (x.y)  £  Ve. 


Proof 

(1)  if  ople)  =  lnserl(x,y),  then  by  the  y  axis  Insert  row  in  Table  3-1,  Rj,  and  Rj,  we 
conclude  (x,y)  £  Vj  (only  AbsentNotSeen  is  possible  on  the  X  axis). 

(2)  if  opli)  =  Updatelxj),  then  using  Rj :  if  fxj 0  t  V<  then  there  must  exist  e  £  E  such 
that  Earlierle.e),  e  -*  i,  and  op(e)  =  Updatelxy)  where  y  x  y 

By  lemma  3.  whenle)  <  whenle') 

By  definition  then  Earlierle.i)  and  thus  (xj)  £  V< 

(3)  .  .By  (1)  and  (2).  (xj)  i  \4.V 


.■> .  - 
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Lemma  5 


Proof 


Let  e  i  E  such  that  e  e*,  opfi)  £  {Insert(x,y),  Update(x,y)}.  In  addition,  assume  €  is  the 
event  which  made  (xj)  €  **V. 

Then  if  there  exists  e **  €  E  such  that  e  -*  e**  — »  e*,  wherefe**)  =  wherefe*),  e  t  SS,*»,  and 
e **  is  minimal,  then  Earlierfe.e)  for  all  e  i  E  such  that  t t-»(wherefe))  <  whenle)  and  opfe )  € 
BasicOps. 


The  following  diagram  illustrates  the  lemma.  Chosen  on  the  diagram  are  representative 
events  for  e,  e*.  e**,  and  e. 


Assume  given 

1 1 )  e  t  SS«**  means  that  during  event  e**  either 

case  (a):  Knownby  for  the  event  e  in  the  SS  =  Allnodes 

Clearly,  by  the  table  for  any  node  j  there  is  a  path  of  events  from 
wherefe)  to  u>here(e **)  which  includes  node  j. 
case  (b):  RemoteTfwherefi ))  2  whenfe)  and  e  (?  RemoteSS  (i.e.,  y.op  = 
AbsentScen) 

Again  by  the  table  and  case  (a),  for  any  node  j  there  is  a  path  of  events 
from  wherefe)  to  wherefe**)  which  includes  node  j. 

(2)  Let  e  be  such  that  wherefe))  <  uihenfe).  By  (1)  then  there  must  exist  an  event  a 
such  that  £-*  q-*  e,4  *  e,  a-*  e**,  and  wherefa)  =  wherefe) 

(3)  By  lemma  3.  whenfe)  <  whenfe)  and  thus  Earlier(€,e).V 


.  • .  -j.  • 


.y -v 
\  '  w- 


•WWW 


V  V 
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Now  the  correctness  of  the  algorithm  is  shown. 


Theorem 


Proof 


(Ve’€  E)luieu>(e’)  =  Ve  ] 


Again  we  use  a  diagram  to  help  illustrate  the  proof.  Chosen  on  the  diagram  are 
representative  events  for  e,  e*,  e **,  e  and  e'. 


Assume  (x.y)  €  vieuMe ) 

(1)  By  V|andV2-  there  exists  e  €  F.  such  that 

(a)  e-+e\op(e)£  {lnserUx,y),  Updateix,y)\ and 

(b)  (Ve)  [Remouedte.e)  =>  “>  (e-»e’)l 

(2)  Thus  by  ( lb)  there  can  not  exist  an  e  €  E  such  that  Removed(e.e)  and  e  e’ 

(3)  By  lemma  4.  (x,y)  €  V* 

(4)  Let  us  assume  (x,y)  €  V,- 

(5)  Let  e*  €  E  be  such  that  e  — *  e*  — *  e\  (x,y)  £  ,»V,  (xj)  t  Vc»,  with  e  the  event  which 
made  (xj)  i  *»V,  and  minimal 

(6)  If  (xy)  €  V,»,for  any  y’  i  W,  then  by  the  table  there  must  exist  e  (  E  such  that  op(e) 
=  Deletefx)  and  e  -*  e*.  Thus  Remove<Ue,i),  a  contradiction 

(7)  Iffx^y’)  i  V^.for  some  y'  €  W,  (y’  *  y)  then  by  lemma  2  and  Rf  there  must  exist  e  €  E 
such  that  op(e)  =  Updatefxj  )  and  e  -*  r* 

(8)  Either  op(i)  =  lnsert(xj)  or  op(t)  =  Update(xj) 

if  op(e)  =Update(xy)  and  op(i)  =  Iruertfxy)  then  Removed(e,e),  a 
contradiction 

if  op(e)  =  Update(x,y’)  and  op(i)  =  Update(x,y)  then  by  the  table  (tel  axis  y: 
Update ;  [2]  axis  X:  Update,  AbsentSeen,  Absent  Not  Seen) 
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Update/ Update:  (x,y')  £  Ve»  only  if  Earlier(e,e),  Removed(e,e),  a 

contradiction 

U pdate/A bsentSeen:  by  definition,  £  £  ,»SS,  and  when(e)  S  e#t (where(e)). 

This  is  impossible  by  the  minimality  of  e* 

Update/ AbsentNotSeen:  by  definition,  £  £  ,*SS,  and  whente)  >  e»t( where/e)). 

Therefore  there  must  exist  e**  £  E  such  that  it  is 
minimal  and  £  — »  e **  e*  -*  e\  with  where(e**)  = 
where(e*).  and  e  £  SSf»». 

Ry  lemma  1.  e*t Hwhere(e))  >  te»du>hpre(e)).  Thus 
whenfe)  >  t t»*(u>here(e)).  Finally  by  lemma  5 
Earlier ( £,e)  and .  .  Removed ( e,£),  a  contradiction 

(9)  Thus  (x,y)  £  V,\ 


2  Assume  (x,y)  £  Ve- 

(1)  By  lemma  2.  there  exists  e  £  E  such  that  £  e\op(£)  £  {Insert(x.y),  Update(x.y)}  and 

is  responsible  for  placing  (x,y)  into  V,- 

(2)  Assume  there  exists  an  e  £  E  such  that  Removedfe.e)  and  e  -*  e' 

(3)  l^et  e*  £  E  be  minimal  such  that  e  -*  e*  -*  e’and  e  -*  e* 

(4)  By  inspection  of  the  resolution  table  we  know  that  once  an  event’s  item  is  removed 
from  V  that  same  event’s  item  can  not  be  returned  to  the  database.  Thus  bv  ( 1 ),  (2), 
and  (3).  (x,y)  (V,. 

(5)  Clearly,  op(e)  £  {Update(x.y'),  Delete(x)} 

if  op(e)  =  Delete(x)  then  (x,y)  £  V,»  a  contradiction. 

if  op(e )  =  Update/ x.y")  and  op(£ )  =  Insert(x.y)  then  <x,y')  £  Ve.,  so  (x,y)  £  V^»  a 
contradiction. 

if  op(e)  =  Update(x,y ’)  and  op(e)  =  Update(x.y)  and  EarlieHe.e)  then  by  the  table 
clearly  ( x,y' )  £  V,«.  s o(x,y)  £  Vr#  a  contradiction. 

(6)  By  (5)  then  there  can  not  exist  an  e  £  E  such  that  Removed(e,£)  and  e  ->  e  Thus,  (Ve) 

[ Removed(e.£ )  =»  [e-*e’)\ 

(7)  .  .  By  (1)  and  (6).  Vj  and  Vj  hold  for  e'.  Thus  (x.y)  £  vieu>(e’).V 


Now  the  fact  that  the  views  are  mutually  consistent  is  shown. 
Corollary 


If  sufficient  correct  communication  between  nodes  occurs,  and  changes  to  the  data  cease  (no 
events  from  the  BasicOps ),  then  all  database  views  will  converge  to  contain  the  same  data. 


Proof 


Consider  the  theorem  and  a  sequence  of  events  which  are  taken  only  from  OtherOps  If 
Transmit  and  Receive  operations  are  occasionally  performed  on  every  node  and  a 
communication  path  exists  between  every  node,  then  the  corollary  follows.V 
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8.  Summary 

This  paper  has  presented  a  suite  of  decentralized  algorithms  for  maintaining  distributed  replicated 
data  of  the  type  which  is  usually  found  in  directories  or  dictionaries.  The  algorithms  are  robust  and 
are  intuitively  easy  to  understand.  Although  they  do  not  attempt  to  guarantee  serial  consistency, 
they  are  adequate  for  many  simple  data  storage  problems.  The  algorithms  require  little  support  from 
the  communication  system  (basically  only  that  if  a  message  is  delivered,  it  is  ungarbled). 
Applications  which  may  benefit  from  the  type  of  algorithms  presented  include  mail  systems,  naming 
servers,  appointment  calendars,  certain  types  of  file  dictionaries,  operating  system  load  data 
maintenance  and  distributed  process  control  systems.  The  main  approach  taken  to  accomplish  the 
goals  of  the  algorithms  (availability,  performance,  and  simplicity)  involves  custom-tailoring  the 
algorithms  to  the  special  requirements  of  client  applications.  This  tailoring  is  simplified  by  using 
resolution  tables  which  specify  the  resolution  strategy  for  action  conflicts.  The  correctness  condition 
for  one  of  the  algorithms  was  defined  and  the  algorithm  was  proved  to  be  correct. 
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Abstract 


One  of  the  problems  fundamental  to  operating  systems  is  maintaining  the  atomicity  of  a  sequence  of 
operations  despite  concurrent  activity  or  system/client  failures  Atomic  actions  have  been  used  for 
this  purpose  in  database  systems  and  recently  in  programming  languages.  This  paper  introduces 
support  for  atomicity  in  the  kernel  of  an  operating  system.  This  support  is  not  limited  to  managing 
just  one  type  of  data  {eg,  files)  and  could  be  used  to  ensure  that  uny  action  (or  task)  be  accomplished 
atomically  on  a  set  of  user  definable  objects.  The  atomicity  framework  presented  uses  processes, 
actions,  and  objects.  Requirements  for  atomicity  are  discussed  and  system  primitives  are  defined 
which  include  the  ability  to  create  and  terminate  nested  actions,  control  concurrency  between 
actions,  and  recover  from  action  aborts.  The  facilities  presented  provide  system  designers  and 
programmers  with  the  ability  to  control  consistency  requirements  using  whatever  semantic 
knowledge  is  available  The  atomicity  thus  attained  is  called  semantic  atomicity.  Unlike  other  work, 
we  do  not  tightly  bind  processes  to  actions,  thus  allowing  the  facilities  presented  to  be  applicable  to  a 
wide  da.-s  of  systems  (including  applications  where  actions  are  supported  by  cooperating  processes). 
One  particular  approach  for  integration  of  the  facilities  is  discussed  related  to  the  Clouds 
decentralized  global  operating  system.  The  desirability  for  semuntic  atomicity  is  illustrated  through 
a  file  directory  system  example.  Use  of  the  facilities  to  address  the  problem  of  actions  supported  by- 
cooperating  processes  is  also  illustrated  through  an  example. 
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1.  Introduction 

Much  of  the  recent  work  concerning  reliability  and  data  integrity  in  systems  has  focused  on  atomic 
actions  (atomic  transactions)  |Gray78,  Davi78,  Kswa76|.  Wc  will  refer  to  atomic  actions  simply  as 
actions  throughout  this  paper.  Actions  represent  tasks  which  must  be  accomplished  indivisibly.  As 
such  they  form  the  basic  units  of  both  recovery  and  concurrency  control  and  can  be  characterized  by 
two  properties: 

•  failure  atomicity:  either  all  results  of  an  action  are  applied  to  the  objects  referenced  by  the 

action  or  none  are  applied 

•  concurrency  atomicity:  the  effect  of  executing  actions  concurrently  must  be  the  same  as  if  each 

action  executed  indivisibly  (i  e.,  atomically)  Thus,  an  action's  steps  can 
be  interleaved  with  other  actions'  steps  so  long  as  the  result  appears  the 
same  as  if  the  actions  were  run  serially.  That  is,  the  execution  sequence 
is  correct  if  it  is  serializable  [Rswa76J. 

Actions  can  terminate  either  abnormally  (by  aborting)  or  normally  (by  committing).  Actions  which 
are  used  within  other  actions  for  failure  containment  reasons  are  called  nested  actions  [Davi73. 
KoedTS,  MossSl,  Lync83|  Nested  actions  appear  atomic  to  the  surrounding  action  or  scope.  That  is, 
both  of  the  atomicity  properties  above  apply,  but  become  relative  to  the  current  nesting  scope.  Thus 
nested  actions  fail  independently  of  each  other  and  the  surrounding  action,  but  commitment  depends 
on  the  surrounding  action  to  commit.  During  execution  an  action  activation  tree  is  naturally  formed 
Nodes  in  the  tree  are  actions  and  edges  represent  nesting  relationships  When  a  nested  action  is 
created,  it  becomes  a  child  of  the  surrounding  or  />arent  action.  All  the  immediate  children  of  a  parent 
action  are  siblings.  Ancestors  of  some  action  x  represent  the  set  of  actions  which  completely  define  the 
scope  of  x:  these  include  the  action  x  and  ail  actions  on  the  path  to  the  root  action  (including  the  root 
action)  Descendant  actions  are  similarly  defined 

An  action  which  is  not  nested  is  called  a  permanent  action  because  if  the  action  completes  normally, 
changes  by  the  action  are  permanently  applied.  Permanent  actions  are  root  nodes  in  the  action 
activation  tree  Changes  made  by  a  nested  action  are  considered  temporary  until  the  permanent  root 
action  commits.  If  an  action  (or  nested  action)  aborts,  then  all  descendants  of  the  action  are  aborted 
(maintaining  failure  atomicity)  Unless  specifically  qualified,  the  term  action  will  denote  both 
permanent  and  nested  actions. 

Action  support  is  relatively  commonplace  in  distributed  data  storage  systems  using  several  different 
implementation  approaches  (e  g  .  (Svob8l,  Lamp8l{).  However,  most  other  application  areas  tend  to 
use  a  variety  of  specialized  ad  hoc  techniques  to  attain  the  atomicity  properties  of  actions  when  they 
arc  required  One  apparent  reason  why  ad  hoc  approaches  are  used  is  that  object  types  arc  usually 
defined  a  priori  by  the  action  facility.  Different  data  granule  sizes  may  be  used,  but  facilities  do  not 
exist  which  allow  arbitrary  objects  to  be  defined  and  operated  upon  by  actions.  Many  simply  use  disk 
pages  or  files.  For  example,  suppose  that  it  is  desired  to  operate  upon  a  specialized  queue,  a  set,  a  file, 
a  tree-based  file  directory  and  a  storage  allocation  module  atomically  In  any  system  which  rigidly 
structures  objects,  this  becomes  either  impossible  or  exceeding  expensive  because  these  general 
objects  must  be  mapped  onto  the  supported  objects  (eg.,  disk  pages)  Thus  an  extensible 
(programming)  environment  for  managing  actions  is  desired.  Other  research  addressing  extensible 
environments  include  (Lome77,  Lisk82,  Reed82j. 

Extensible  schemes  which  have  been  proposed  have  used  atomic  actions  to  structure  processes  (e  g  . 
lLome77,  Lisk82|).  This  approach  is  a  very  convenient  structuring  methodology,  but  it  can  not 
address  certain  system  problems.  In  particular,  communicating  processes  (Dijk68)  are  incongruent 
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with  this  structure,  even  though  the  processes  may  be  cooperating  to  perform  some  action.  (Consider 
a  producer/consumer  example  with  unbounded  message  stream  where  both  producer  and  consumer 
are  actions  )  Processes,  performing  actions,  in  this  structuring  approach  can  communicate  only  after 
one  of  the  actions  has  committed  This,  however,  is  clearly  impossible  if  the  processes  must 
communicate  to  complete  the  actions.  Unlike  these  prior  extensible  schemes  we  address  action 
structure  and  processes  independently  and  do  not  bind  actions  tightly  to  particular  processes  The 
above  mentioned  problems  are  then  avoided 

Although  atomic  actions  address  certain  problem  areas  well,  there  are  environments  where  the 
atomicity  properties  stated  above  are  either  too  strong  or  inappropriate  It  is  well  known  that 
serializability  is  too  restrictive  for  certain  applications  |Lamp76,  Garc82|.  In  some  sense  a  more 
general  form  of  atomicity  is  involved  in  these  applications  This  is  usually  directly  related  to  having 
more  semantic  information  available  [Kung79,  Papa79|.  However,  trading  serializability  for 
performance  has  been  noted  as  well  (Fisc82,  Jens82,  McKe83|.  Different  levels  of  consistency  have 
been  discussed  bv  Gray  |Gray75|,  but  these  levels  are  oriented  toward  a  simple  data  framework  As 
such,  the  consistency  degrees  suggested  are  not  sufficient  to  capture  the  lower  levels  of  consistency 
available  in  a  general  setting  In  this  light,  a  general  atomicity  support  system  should  permit 
different  degrees  of  atomicity  to  capture  the  necessary  correctness  conditions,  without  being  overly 
restrictive 

We  arc  investigating  atomicity  mechanisms  which  can  he  embedded  in  operating  systems  and 
hardware  to  allow  applications,  as  well  as  certain  portions  of  the  operating  system,  to  benefit  from  the 
common  facilities.  We  believe  that  integration  of  extensible  facilities  for  achieving  atomicity  into  an 
operating  system  is  quite  novel  Even  though  atomicity  support  for  data  storage  (in  particular  files) 
has  been  suggested,  our  work  involves  a  much  more  radical  integration  such  that  arbitrary  aspects  of 
an  operating  system  or  application  can  be  structured  using  actions.  We  specifically  desire  not  to  limit 
what  task  an  action  may  perform.  For  example,  we  do  not  want  the  only  objects  which  can  be 
supported  to  be  storage  pages.  While  this  is  acceptable  for  integration  of  action  facilities  Tor  storage 
systems  like  files  or  databases,  we  instead  desire  a  general  programming  environment  for  actions 
where  the  properties  of  actions  can  be  defined  over  any  part  of  the  system.  Thus,  the  objects  that  an 
action  may  manipulate  may  be  programmable.  Each  object  referenced  could  further  use  nested 
actions  when  manipulating  other  objects.  The  atomicity  which  we  desire  in  this  environment  we  w  ill 
call  semantic  atomicity ,  as  opposed  to  the  absolute  atomicity  of  the  conventional  approach.  That  is, 
the  meaning  of  atomicity  depends  on  precisely  what  the  action  is  attempting  to  do  [Allc82l  The 
concept  of  semantic  atomicity  encompasses  the  notion  of  absolute  atomicity  as  stated  above 

One  uniform  structuring  approach  for  systems  uses  data  abstraction  and  the  object  model  Uone79j 
Within  this  paper,  we  will  structure  the  world  accordingly.  We  consider  this  choice  to  be  neither 
universally  good  nor  bad,  and  the  basic  concepts  presented  for  providing  atomicity  facilities  are  not 
limited  to  this  particular  view  Message-based  systems  may  approach  certain  aspects  of  the  atomicity 
problem  differently  (e.g.,  assigning  processes  to  actions),  but  the  fundamental  aspects  of  the  atomicity 
facilities  presented  appear  adequate  Thus  our  contribution  spans  both  message-based  and 
procedure-based  systems 

This  paper  describes  the  general  system  architecture  we  propose  for  managing  atomicity,  the 
synchronization  and  recovery  facilities  we  provide,  and  how  these  mcchunisms  might  be  incorporated 
into  an  actual  system.  As  an  example  environment,  we  use  the  Clouds  [McKc83|  decentralized  global 
operating  system  currently  under  construction  for  a  local  area  network  of  Three  Rivers  Perq 
computers.  We  believe  that  atomicity  is  particularly  important  for  distributed  systems  because  of  the 
independent  failure  modes  of  the  nodes  Semantic  atomicity  is  also  important  because  of  the  desire  in 
distributed  operating  systems  to  sacrifice  consistency  for  performance  |Jens82,  McKe83| 

Section  2  details  some  of  the  requirements  which  must  be  addressed  by  any  atomicity  facilities 
incorporated  into  an  operating  system  kernel  Our  system  model  and  the  general  atomicity 
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primitives  we  propose  are  presented  in  Section  3  Section  4  discusses  how  these  primitives  might  be 
incorporated  into  an  object-based  system.  Section  5  and  6  contain  examples  (5  illustrates 
synchronization  and  recovery  in  a  directory  object  including  operations  which  implement  semantic 
atomicity  and  6  illustrates  an  action  performed  by  cooperating  processes  communicating  through 
messages).  Substantial  additional  information  is  available  in  (Allc83). 

2.  Atomicity  Requirements 

Compared  to  database  systems,  operating  systems  contain  entities  with  more  complex  semantics 
While  automatic  support  for  atomicity  is  highly  desirable,  it  may  be  more  efficient  in  many  cases  to 
provide  the  systems'  constructors  with  the  tools  necessary  to  build  atomic  actions  This  seems 
reasonable  for  operating  systems  and  system  applications  because  the  writers  are  usually  quite 
knowledgeable  about  the  semantics  of  the  system  and  can  probably  provide  (cheaper)  atomicity  using 
these  tools  From  these  tools,  automatic  action  support  could  be  constructed  for  specific  application 
areas  (e  g.,  database  systems,  object  repositories,  action-based  languages,  etc  ).  Thus  the  approach  of 
providing  synchronization  and  recovery  tools  appears  promising. 

The  tools  approach  has  a  tacit  assumption  concerning  the  reasons  for  recovery  Errors,  unexpected 
conditions  (such  j*  software  modules  failing  to  meet  their  specification),  can  not  be  handled  with  this 
approach  We,  however,  are  much  more  concerned  with  failures  (expected,  although  undesirable. 
conditions--for  example,  node  and  network  failures,  access  rights  violations,  or  process  faults  such  as 
division  by  zero)  Thus,  unlike  recovery  blocks  and  conversations  [Kand78,  Russ80,  Shri781,  failures 
must  be  anticipated. 

As  discussed  in  Section  1,  it  is  important  to  address  semantic  atomicity  Consider  a  file  directory 
Most  clients  of  the  directory  do  not  care  when  a  listing  is  made  if  they  see  transient  (uncommitted) 
changes.  Forcing  operations  of  this  type  to  be  atomic  will  result  in  low  levels  of  concurrency  on  the 
directory  Of  course,  a  file-backup  client  of  the  directory  may  insist  on  seeing  a  serial  view.  Thus, 
what  is  acceptable  depends  on  the  semantics  of  use  In  many  cases  it  is  possible  in  operating  systems 
to  know  a  priori  these  requirements  and  thus  (if  the  facilities  were  available)  take  advantage  of  these 
semantics 

It  is  also  important  not  to  exclude  cooperating  processes  from  the  atomicity  support.  In  fact,  it 
appears  der.irahlc  in  operating  systems  to  not  automatically  assign  actions  to  processes.  Instead  a 
more  dynamic  scheme  is  required  which  will  allow  one  process  to  support  many  actions  or  several 
cooperating  processes  to  support  one  action  (e  g  ,  the  client/server  model  with  cooperating  servers  fits 
this  paradigm). 

In  general  there  are  five  areas  of  support  necessary  for  atomicity.  F'irst,  there  must  be  some  method 
for  the  users  to  create  and  terminate  actions.  Second,  there  should  be  synchronization  facilities  (in 
addition  to  process  synchronization)  which  can  be  used  by  processes  to  maintain  the  atomicity 
requirement  between  actions  (concurrency  atomicity).  Locks  and  timestamps  | KohlSl  |  are  typical 
synchronization  tools  used  in  database  systems  for  this  purpose.  Third,  there  must  be  recovery 
facilities  which  permit  flexible  management  of  data  necessary  to  recover  the  action  to  a  consistent 
state  (failure  atomicity).  Logs  |Gray78|  containing  cither  before  or  after  images  (or  both)  have 
typically  been  used  for  this  purpose  in  database  systems.  In  nested  action  environments  automatic 
propagation  of  synchronization  and  recovery  information  to  the  parent  action  is  also  desirable 
Fourth,  because  the  support  for  atomicity  is  not  performed  completely  automatically,  there  must  be 
facilities  which  permit  user  defined  processing  on  the  transition  of  an  action  to  unolher  state  (e  g  . 
performing  recovery  on  the  operation  -*  abort  transition)  Finully,  there  must  be  process  agreement 
facilities  which  allow  the  processes  performing  an  action  to  reach  a  consensus,  despite  failures, 
concerning  action  state  transitions  (particularly  the  operation  -*  commit  transition). 
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3.  System  Primitives  for  Supporting  Atomicity 


3.1  System  Model 

Physically  we  view  the  environment  as  composed  of  nodes  and  an  interconnection  network  The  nodes 
communicate  through  messages  sent  through  the  network.  The  nodes  contain  two  different  types  of 
memory:  volatile  and  permanent.  Nodes  may  crash  (fail)  erasing  the  contents  of  volatile  memory,  but 
without  disturbing  permanent  memory.  When  a  crash  occurs  we  assume  that  all  processing  stops  and 
that  random  messages  and  random  changes  to  permanent  memory  do  not  occur.  The  network  is  also 
unreliable  and  can  lose,  duplicate,  or  re-order  transmitted  messages  Messages  if  delivered,  however, 
must  arrive  ungarblcd  That  is,  message  corruption  must  he  detectable. 

There  are  three  logical  entities  in  the  system:  processes,  objects,  and  actions.  Processes  are  active 
agents  which  execute  at  a  single  node.  Processes  may  be  directly  created  and  terminated  only 
through  the  kernel  at  that  node.  Node  failures  can  indirectly  terminate  a  process.  Actions  are  units 
of  concurrency  and  recovery  Actions  muv  span  node  boundaries  and  may  be  concurrently  performed 
al  several  nodes.  The  node  where  an  action  is  created  is  considered  the  coordinator  node  for  the 
action  Actions,  via  processes,  manipulate  objects  An  object  may  be  considered  to  be  an  instance  of  a 
generalized  abstract  data  type  (even  though  not  necessarily  implemented  this  way)  which  can  only  be 
operated  upon  through  well-denned  operations.  During  an  action,  objects  referenced  must  not  be 
moved  from  the  node  where  the  action  first  referenced  the  objects.  If  an  object  is  moved  between 
nodes,  no  action  may  operate  upon  the  ohjcct  during  the  migration. 

Processes,  actions,  and  objects  are  identified  through  processids,  actionids  and  objectids. 
Maintenance  of  process  identification  is  assumed  external  to  the  action  support  environment, 
however  it  is  assumed  that  the  identification  is  unique  within  the  node  where  the  process  is  created. 
Action  identification  must  be  unique  across  all  nodes.  Action  identifiers  are  provided  by  the  action 
support  primitives  discussed  in  Section  3.2.  Object  identification  need  only  be  unique  within  the  node 
where  the  object  is  located  for  the  action  support  facilities.  Even  though  system-wide  uniqueness  is 
not  required  by  the  action  facilities  specifically,  it  may  be  necessary  for  other  aspects  of  particular 
systems  le  g.,  if  objects  can  be  globally  addressed).  One  kernel  primitive  is  provided  to  generate 
unique  objectids,  this  could  be  changed  appropriately  to  achieve  the  uniqueness  required. 

Normally  processes  do  not  recover  from  node  failures.  However,  we  require  a  special  kernel  primitive 
which  allows  processes  to  be  automatically  restarted  at  some  location  following  a  node  crash.  That  is, 
on  restarting  the  system  after  a  failure  a  checkpointed  process  will  resume  at  some  user  definable 
location  with  certain  variables  re  initialized  to  checkpointed  values.  This  is  necessary  to  guarantee 
correct  processing  of  the  action  state  transitions. 


3.2  Action  Creation,  Use,  and  Termination 

Action  creation  is  performed  through  the  following  kernel  function: 

function  create  action  (actiontype  :  (permanent,  nested),  parent :  actionid) :  aclionid 

The  actionid  is  an  index  into  a  kernel-protected  action  identification  table.  This  table,  one  local  to 
each  node,  contains  information  concerning  the  state  of  the  known  actions,  which  processes  are 
performing  the  action,  and  which  objects  have  been  affected  by  the  action.  Processes  are  free  to  store 
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the  actionid  as  desired  (or  even  pass  it)  in  a  production  implementation,  capabilities  which  associate 
processes  to  actions  would  probably  be  necessary.  The  actiontype  parameter  specifies  whether  the 
action  should  be  created  as  a  permanent  action  (no  nesting)  or  relative  to  the  specified  parent  action. 
It  is  possible  that  the  parent  does  not  exist  anymore  and  in  this  case  the  caller  receives  an  error  on 
invocation. 

Processes  may  execute  on  behalf  of  only  one  action  at  a  time;  binding  actions  to  processes  is  performed 
dynamically  Assuming  the  action  specified  still  exists,  a  process  can  become  linked  to  the  action 
through  the  following  kernel  call: 

procedure  link  (newaction  :  actionid) 

This  dynamic  assignment  permits  processes  to  manage  several  actions,  if  desired.  This  ability  is 
particularly  attractive  for  server  processes  managing  several  user  actions,  for  example  Linking  to 
an  action  x  automatically  unlinks  any  action  y  currently  linked  to  the  process.  A  null  actionid  is 
available  to  unlink  the  process  from  all  actions. 

It  may  be  necessary  to  determine  what  actionid  is  currently  linked  to  a  process.  This  is  useful  for  the 
sy  nchronization  discussed  below. 

function  hllul  actionid 

Termination  (commitment  or  abortion)  of  an  action  is  performed  as  shown  below.  Both  procedures 
can  return  errors  if  the  action  does  not  exist  (e  g.,  already  aborted)  If  a  process  terminates,  then  all 
actions  associated  with  the  process  (determined  from  the  action  identification  table)  are  aborted. 
(Kccull  we  are  not  addressing  software  errors.)  Both  of  the  termination  procedures  operate  on  the 
action  currently  linked  to  the  invoking  process. 

If  a  nested  action  is  being  committed,  all  synchronization  state  and  recovery  logs  are  inherited  by  the 
action's  parent  (because  the  child  has  completed).  If  a  nested  action  has  visited  remote  nodes,  a  one- 
phase  distributed  commitment  protocol  is  begun.  If  a  permanent  action  has  visited  remote  nodes,  a 
two-phase  commit  protocol  is  used  lGray78).  Once  all  recovery  information  is  safely  stored  in 
permanent  memory,  special  user-definable  procedures  are  performed  to  complete  the  commit 
processing  (see  below )  The  timelimit  associated  with  the  commit  procedure  is  useful  when  multiple 
processes  are  cooperating  on  an  action  If  the  associated  processes  do  not  request  commitment  within 
the  specified  duration,  then  instead  of  committing,  the  action  is  aborted.  All  cooperating  processes 
must  agree  that  the  action  is  complete  by  executing  the  commit  primitive  before  final  commitment 
occurs.  Thus,  we  avoid  the  domino  effect  [Rand78|. 

procedure  commit  (timelimit :  timedurationlype) 

procedure  abort 

As  a  process,  on  behalf  of  an  action,  accesses  an  object,  a  scries  of  events  occur.  These  events  are 
diagrammed  in  Figure  1.  Special  client  procedures  can  be  defined  for  all  three  of  the  special  events: 
BOA  (beginning  of  action),  EOA  (end  of  action),  and  Abort. 

Processes  must  inform  the  kernel  when  a  new  object  on  the  current  node  has  been  referenced.  In 
addition,  the  processing  code  for  the  events  of  BOA,  EOA,  and  Abort  must  be  defined  for  this  object.  If 
special  event  processing  is  not  required  for  one  or  all  of  these,  a  special  procedure  name  of  none  can  be 
used.  The  same  event  procedure  code  can  be  shared  by  several  objects  if  the  processing  required  is  the 
sume  Event  procedures  can  not  use  actions  during  their  processing  Processes  inform  the  kernel  of 
the  referenced  objects  through  the  following  primitive: 
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BOA 

operations  «JL^  EOA  (commit) 

►  Abort 


Figure  1  Action  Events  Related  to  Objects 


procedure  touchobjcct  (object :  objectid,  boa,  eoa,  abort :  procedure); 


The  first  time  (ouchobject  is  executed  for  an  action/object  pair,  the  operating  system  updates  the  list  of 
objects  referenced  by  that  action.  This  is  used  to  execute  the  associated  event  procedures  on 
termination  of  the  action  The  event  procedures  which  may  be  defined  are 

BOA  beginning  of  action 

This  procedure  is  performed  immediately  following  the  invocation  of  the 
touchobjrvt  primitive  assuming  that  this  is  the  first  time  this  object  has  been 
touched  by  the  current  action. 

EOA  end  of  act  ion  (commit). 

The  EOA  code  is  executed  after  the  recovery  area  is  safely  stored  in  permanent 
storage  following  permanent  action  commitment,  it  is  not  executed  on  nested 
action  commitment.  This  event  processing  procedure  must  be  written  in  an 
idempotent  manner.  That  is,  it  may  be  (re)executed  many  times  due  to  system 
failures  and  any  complete  execution  must  be  correct  regardless  of  prior  partial 
executions. 

Abort  abort  action 

Once  an  action  has  been  aborted  by  a  process,  the  Abort  code  associated  with  all 
objects  touched  by  the  action  is  executed  The  Abort  event  does  not  occur  if  volatile 
memory  fails  This  is  explained  further  in  Section  3.4 


Even  though  these  are  specified  as  procedures  here,  this  same  event  scheme  could  be  used  in  a 
message  based  system  This  could  be  accomplished  by  defining  event  messages  (possibly  by 
exceptions  or  emergency  messages)  to  represent  these  action  state  changes  and  requiring  each 
process  to  appropriately  handle  the  events 


Ifa  process  transmits  an  actionid  to  a  remote  node  while  processing  an  action,  the  local  kernel  must 
be  informed  which  node  was  accessed  This  information  is  used  for  coordinating  action  state 
transitions  The  following  kernel  call  is  used  to  inform  the  local  kernel  of  the  access.  If  the  process 
does  not  actually  access  the  remote  node  for  some  reason  after  executing  this  call,  it  is  unimportant 
since  the  atomicity  system  will  discover  this  from  the  remote  node  during  action  termination 

procedure  o/Tnode  (remotenode  :  nodeid); 


The  last  kernel  primitive  permits  processes  to  request  the  state  of  an  action  This  is  particularly 
useful  when  multiple  processes,  working  cooperatively  on  some  action,  must  reach  agreement  before 
deciding  whether  to  abort  or  commit.  This  can  be  used  in  the  implementation  of  the  conversation 
concept  (Rand78|.  Section  6  illustrates  this  primitive  in  a  cooperating  process  environment. 

procedure  notify  (action  :  actionid;  state  :  (active,  aborted,  complete,  unknown)) 

These  primitives  are  sufficient  to  manage  both  permanent  and  nested  actions  in  an  elegant  manner 
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For  example,  even  in  a  nested  user  action  it  is  possible  to  perform  a  controlled  violation  of  the  current 
action  nesting  for  maintaining  operating  system  state  data  (eg.,  process  queues).  In  addition, 
intrinsic  to  the  system  is  the  concept  of  cooperating  processes  performing  an  action.  This  is  a  natural 
extension  to  the  use  of  cohorts  [Gray781  in  distributed  database  systems,  where  a  transaction  has  a 
support  process  (a  cohort)  on  each  node  it  visits. 


3.3  Action  Synchronization  Facilities 


Processes  may  need  to  perform  specialized  synchronization  with  respect  to  one  another  if  they  are 
linked  to  actions.  It  is  possible  to  control  action  synchronization  via  most  any  general  process 
synchronization  scheme,  because  processes  have  access  to  actionids  and  also  have  the  ability  to 
determine  those  conditions  that  constitute  a  conflict.  However,  for  convenience  and  efficiency,  we 
propose  common  action  synchronization  mechanisms  be  available  in  the  kernel.  This  does  not 
prevent  coding  specific  synchronization  as  necessary  to  obtain  additional  concurrency  (eg  , 
lLamp76|)  We  provide  two  basic  action  activation  tree  synchronization  mechanisms:  multi-mode 
locking  and  counting  semaphores 


3.3  1  Action-based  Multi  mod'-  Locking 


Locks  |Kswa76|  are  a  reasonable  choice  for  one  mechanism,  because  there  are  many  concurrent  data 
structure  maintenance  algorithms  in  operating  systems  which  use  a  locking  model  (e  g  ,  lKwon82|) 
Our  approach  requires  a  lock  compatibility  table  to  be  defined  before  lock  operations  can  be  used  The 
goal  is  to  provide  a  framework  more  general  than  simple  read/wrile  locking  modes.  The  directory 
example  presented  in  Section  5  illustrates  why  this  approach  is  desirable. 


The  lock  domain,  mode  compatibilities,  and  the  lock  protocol  used  are  determined  by  the  process 
defining  them  By  associating  a  domain  with  each  lock  type,  it  is  possible  to  solve  the  phantom 
problem  |F.swa76]  That  is,  entities  do  not  have  to  exist  at  the  time  they  are  locked.  Again,  the 
directory  example  illustrates  the  significance  of  this.  By  allowing  programmers  to  control  lock 
protocols,  coordination  schemes  such  as  non-two-phase  protocols  can  be  used  [Moha82|,  driven  by  the 
semantics  of  the  accessing  pattern  Below  are  the  locking  operations 


modetype  =  integer,  {system  dependent} 

lockidlype  =  integer  {system  dependent} 

instanceidtype  =  integer;  {system  dependent} 

compatibilities  =  record 

moderequesting  :  modetype; 
compatiblcsct  :  set  of  modetype; 

end, 


procedure  defmeconfhct  (lockid  :  lockidtype,  nccesstable  :  set  of  compatibilities) 
function  setlock  (lockid  lockidtype.  thing  :  instanceidtype;  m  :  modetype;  timeout  integer) 
:  (okfirsttime, okothertimes,  timeout,  invalid) 
function  testlock  (lockid  :  lockidtype,  thing  :  instanceidtype;  m  :  modetype;  aid  :  actionid) 

;  (ok,  conflict,  invalid) 

function  releaselock  (lockid  :  lockidtype;  thing  :  instanceidtype,  m  ;  modetype) 

:  (ok,  notset,  invalid) 

function  rtleaseall  (lockid  :  lockidtype) .  (ok,  invalid) 


Suppose  a  process  requests  a  lock  in  some  mode  m  and  is  linked  to  action  x.  If  only  uncestor  actions  of 
x  from  the  action  tree  hold  incompatible  lock  modes  to  m,  the  lock  is  set.  For  example,  in  the  simple 
shared  read  /  exclusive  write  situation,  only  x's  ancestors  can  hold  write  mode  locks  and  still  permit  x 
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to  obtain  a  write  lock.  As  actions  commit  the  ownership  of  the  locks  propagate  to  the  immediate 
parent  action.  If  a  setlock  is  executed  and  the  lock  cannot  be  set  because  of  mode  incompatibilities, 
the  process  is  suspended  until  the  lock  can  be  set  or  until  a  timeout  occurs.  Once  x  operates  on  some 
lock,  x's  ancestors  may  not  touch  that  lock  until  x  terminates. 

The  special  status  indicators  of  okfirstime  and  okothertimes  used  in  setlock  are  provided  so  that 
applications  can  detect  when  they  have  already  locked  an  object  in  the  given  mode  This  information 
is  useful  for  determining  when  it  is  necessary  to  save  the  state  of  the  object  through  the  recovery 
facilities 


3.3  2  Action-based  Counting  Semaphores 


Locking  as  presented  above  allows  actions  to  avoid  one  another  in  order  to  achieve  serializability  It 
is  also  desirable  in  some  cases  to  have  the  ability  to  apply  additional  ordering  constraints  For 
example,  guaranteeing  one  sibling  will  execute  before  another  appears  to  be  a  common  prob  'see 
for  instance  the  example  in  Section  4.1).  Our  work  is  novel  in  generalizing  ac  >  tree 
synchronization  in  this  manner. 


In  a  nested  action  environment  semaphore  values  are  managed  according  to  the  visibilit;  ■  -  an 
action  has  depending  on  the  action's  location  in  the  action  activation  tree.  Thus  an  action  i.  r  ,e  a 
V  value  from  an  ancestor,  but  not  from  an  sibling  Upon  commitment  the  changes  to  the  semaphore 
are  appropriately  propagated  If  the  action  aborts,  the  borrowed  values  from  it’s  ancestors  are 
returned.  Thus  this  mechanism  is  an  extension  of  standard  counting  semphores  to  the  realm  of 
reliable  computing  in  a  nested  action  environment.  As  with  the  locking  mechanism  presented  above, 
once  a  child  executes  a  semaphore  operation  on  some  semaphore,  no  ancestor  may  reference  that 
semaphore  until  the  child  completes.  If  processes  are  cooperating  performing  some  action,  then 
because  they  will  be  using  the  same  actionid,  the  action-based  semaphores  become  equivalent  to 
standard  semaphores.  Further  details  and  the  associated  algorithms  are  included  in  |Allc83|  The 
operations  are  shown  below: 


semaidlype  =  integer; 


{system  dependent} 


function  dermesemaphore  (initialvalue  :  integer) :  semaidtype 
function  destroysemaphore  (semaid  :  semaidtype) :  (ok,  invalid) 
function  actionP  (semaid  :  semaidtype;  timeout  integer) :  (ok,  timeout,  invalid) 
function  actionV (semaid  :  semaidtype) ;  (ok,  invalid) 

3.3.3  Guaranteeing  Progress 

Specific  support  for  deadlock  and  livclock  is  not  provided  by  the  kernel.  Appropriate  system  structure 
and  associated  lock  and  semaphore  protocols  can  prevent  deadlock  in  many  cases  However,  if 
deadlocks  can  occur  in  the  system,  the  responsibility  for  appropriate  action  lies  with  the  implementor 
(using  timeouts,  etc.).  If  locking  were  the  only  mechanism  used  for  action  synchronization,  then 
deadlock  detection  would  be  straightforward  (although  probably  expensive).  However,  as  discussed 
above,  processes  may  perform  specialized  synchronization  between  actions  without  using  locks.  This 
makes  the  problem  extremely  difficult  because  it  may  not  be  possible  to  determine  which  actions 
another  action  may  be  waiting  for.  We  will  not  discuss  this  further  here 


3.4  Action  Recovery  Facilities 


Logging  appears  to  be  a  reasonable  method  for  maintaining  action  recovery  information.  To  support 
logging,  system  primitives  are  available  to  write  and  read  records  associated  with  action/objecl  pairs 
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During  the  life  of  an  action,  records  may  be  written  to  the  log  If  the  action  aborts,  the  log  is  deleted 
after  the  Abort  processing  event  is  complete  If  the  action  commits,  the  log  is  inherited  by  the  action’s 
parent  This  involves  no  data  movement  from  the  recovery  log,  simply  a  notation  to  be  made 
regarding  which  action  owns  the  log  If  the  action  is  permanent,  then  the  log  is  placed  on  permanent 
storage  Each  node  maintains  a  local  log  for  any  action  known  at  that  node  and  each  saves  their 
portion  of  the  complete  log  during  commitment  After  the  logs  are  safely  stored,  the  event  EOA 
occurs  The  log  is  automatically  discarded  upon  completion  of  the  event  EOA 

Any  information  desired  may  be  placed  in  the  log.  However,  because  the  log  for  an  action  does  not 
become  saved  in  permanent  memory  until  the  associated  permanent  action  commits,  recovery  from 
an  action  abort  cannot  require  the  log  to  recover  across  a  volatile  memory  failure.  In  general,  though, 
we  suspect  that  assuming  actions  will  complete  is  the  proper  assumption  for  operating  systems  and 
many  applications  This  optimsitic  viewpoint  dictates  that  changes  be  made  to  the  current  version  of 
shared  entities  using  the  log  to  maintain  the  unaltered  version  This  approach  results  in  much  of  the 
overhead  associated  with  supporting  actions  to  be  tied  to  abort  and  not  commitment 

During  the  first  write  by  an  action  to  the  log  for  some  object,  the  log  is  officially  created  To  notify  the 
recovery  facility  that  the  log  records  should  be  returned,  a  reset  operation  is  used  If  the  items  to  be 
saved  are  memory  pages,  then  it  is  possible  to  integrate  some  of  the  logging  system  with  the  memory 
management  system  (e  g  ,  by  manipulating  page  tables)  The  log  write,  read,  and  re^et  primitives  are 
defined  below 

procedure  writelug  (object  objectid.  <  array  of  items  to  save,  length  and  address  >  ) 
procedure  r^adlog  (object :  objectid, 

<  array  of  addresses  of  where  to  place  returned  items  > ,  status  (ok,  endoflog)) 
procedure  re^etlog  (object  objectid) 

One  possible  extension  of  the  recovery  facilities  involves  client-controlled  checkpointing  of  the  log 
into  a  staging  area  of  permanent  memory  during  nested  action  commitment  However,  this  can 
become  expensive  when  multiple  nodes  are  involved  forcing  two-phase  commitment  to  be  used  during 
every  action  commitment  However,  for  long  running  actions,  this  may  be  necessary  The  general 
approach  in  this  case  would  be  to  subdivide  the  long  running  action  into  a  group  of  nested  actions 
which  could  checkpointed  upon  completion  The  parent  action  would  simply  guarantee  that  the 
changes  would  only  become  visible  if  the  entire  task  was  accomplished. 

3.5  Implementation  Structures 

Shown  below  is  a  rough  sketch  (Eigure  2)  of  the  structure  necessary  to  support  the  facilities  discussed 
above 


4.  One  Possible  Application  Using  the  Primitives 

This  section  describes  how  the  system  primitives  defined  above  might  be  incorporated  into  an  actual 
operating  system  The  approach  used  in  Clouds  (McKe83i  is  to  define  a  programming  language  (a 
Pascal  derivative)  which  has  specific  support  for  actions  and  atomicity  The  compiler  converts  the 
language  constructs  into  the  necessary  system  primitives.  The  language  approuch  is  convenient 
because  it  organises  the  atomicity  primitives  into  a  uniform  structure  and  removes  some  of  the  causes 
for  errors  in  their  use 
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Figure  2  Conceptual  Data  Structures 


In  Clouds,  an  object  type  is  a  globally  named  generalization  of  an  abstract  data  type  which  can  only 
be  operated  upon  through  well-denned  operations.  An  object  is  an  instance  of  some  object  type. 
Objects  are  distributable  in  Clouds  and  may  reside  anywhere  in  the  network  For  the  purposes  of  this 
paper  we  will  consider  only  objects  which  support  actions  (viz.,  support  recovery  and  action 
synchronization).  Objects  in  this  class  can  be  considered  to  be  composed  of  three  basic  components: 
the  data  portion  (data  and  the  operations  on  the  data),  synchronization  necessary  for  shared  access, 
and  recovery  control  In  addition,  some  object  state  may  be  kept  in  permanent  memory  in  order  to 
survive  volatile  memory  failures.  Figure  3  illustrates  a  conceptual  internal  structure  of  an  object. 


Nested  actions  can  be  used  to  specify  units  of  synchronization  and  recovery.  Rach  object  type 
operation  can  be  denoted  as  an  action  definition  (similar  to  (Lome77|).  Action  atomicity  is  used  to 
transform  the  state  of  the  objects  referenced  in  the  action  into  a  new  (consistent)  state.  Semantic 
atomicity  is  desired  for  all  actions  and  it  is  the  responsibility  of  each  object  type  to  ensure  that 
appropriate  abstract  behavior  is  provided 

Object  definers  cun  control  synchronization  among  actions  by  specification  statements  which  are 
provided  when  an  object  type  is  defined.  An  access  statement  is  used  to  specify  the  object  operation 
compatibility  necessary  to  arbitrate  access  between  actions.  These  operation  compatibilities  are 
managed  using  generalized  locking  modes  (possibly  one  for  each  object  procedure)  to  ensure  the 
specification  The  locks  are  managed  through  a  two-phusc  locking  technique  which  ensures  that 
serializable  abstract  behavior  can  be  achieved  The  locks  are  held  until  action  termination. 
Incompatibility  betwen  a  requesting  action  and  an  accessing  action  of  some  object  causes  the 
requesting  action’s  process  to  block  until  some  specified  timeout  occurs  or  access  is  allowed. 


To  force  a  certain  path  or  order  of  executions  of  the  procedures  by  actions,  the  order  statement  can  be 
used.  The  format  of  this  statement  is  similar  to  path  expressions  (Camp74|  and  can  be  directly 
compiled  into  operations  on  action-based  semaphores.  Sequencing  (;),  repetition  (n: ),  and  alternation 
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Figure  3  Data  Object  Structure 

(,)  can  be  specified  in  the  order  statement. 


Object  definers  are  additionally  provided  with  the  tools  necessary  to  synchronize  action  access  to  local 
data  within  the  object.  This  is  accomplished  by  using  a  sync  monitor  similar  to  a  standard 
synchronization  monitor  Illoar74|.  It  can  be  used  to  control  mutually  exclusive  access  to  local 
variables  within  an  object.  Any  object  procedures  can  be  placed  into  the  monitor.  Synchronization 
within  the  sync  monitor  is  possible  through  statements  which  allow  events  to  be  waited  upon  and 
signalled.  Programming  arbit-ary  action  synchronization  is  possible  through  the  synchronization 
monitor  and  via  lock  statements  which  are  directly  compiled  into  lock  system  primitives.  Thus  a  dual 
approach  for  action  synchronization  is  provided,  static  specification  when  an  object  type  is  defined 
and  dynamic  programming  tools  to  address  special  problems.  This  generality  permits  the  tradeoffs  of 
simplicity  and  performance  to  be  adequately  addressed.  The  synchronization  facilities  are  discussed 
in  more  detail  in  Section  4  1. 


Each  object  type  contains  special  procedures  to  manage  the  action  events  of  BOA,  EOA,  and  Abort 
The  compiler  generates  some  additional  setup  code  for  these  procedures,  but  in  general  they  behave 
in  the  same  manner  as  discussed  in  Section  3.1.  Each  object  type  can  also  define  variables  which 
must  be  made  permanent  upon  commitment  of  a  permanent  action. 


In  the  Clouds  object  framework,  actions  can  be  organized  naturally  by  objects  which  reference 
operations  (which  may  he  nested  actions)  on  other  (conceptually  lower  level)  objects.  For  example, 
consider  the  object  types  shown  in  Figure  4.  When  a  createfile  operation  is  executed  the  gelspace  and 
createentry  actions  compose  to  form  the  createfile  action. 
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Figure  4  Natural  Nesting  Example 


Invocation  of  object  operations  by  a  process  is  performed  by  procedure  call. 

< capability  for  object  instance  >  < operation  >  (<parameters>)[weak| 

(exception 

<causej>  :  < statement  list > 

<cause2>  :  < statement  list > 
others  :<  statement  list  > 

end  I 

By  default  the  call  is  reliable  and  is  performed  in  a  manner  which  ensures  "once  and  only  once” 
semantics  (even  if  the  target  object  is  located  on  a  remote  node)  |Spec81 1.  The  calling  process  waits 
for  completion  of  the  call  or  until  an  exception  is  raised  (e.g.,  timeout).  If  the  target  object  does  not 
support  actions,  an  error  is  returned  if  the  executing  process  is  acting  on  behalf  of  an  action  If  an 
action  has  been  linked  to  the  executing  process,  the  compiler  generates  code  to  notify  the  kernel 
accordingly  (through  louchohject).  This  is  used  to  return  to  the  objects  upon  commit  or  abort  of  the 
action  (refer  to  the  KOA  and  Abort  events  discussed  above).  The  optional  keyword  weak  specifies 
that  no  value  is  to  be  returned  and  that  if  the  target  object  is  on  another  node  in  the  network,  then 
only  one  send  need  be  performed  (no  waiting);  the  transmission  is  assumed  unreliable.  The  operation 
therefore  can  be  executed  zero  or  at  most  one  time.  This  option  cun  not  be  used  if  the  executing 
process  is  linked  to  an  action.  If  an  exception  {e.g.,  timeout)  is  raised  and  actions  were  created  during 
the  invocation,  then  these  actions  are  aborted. 

4.1  Action  Synchronization 

The  access  specification,  as  discussed  above,  is  used  to  state  the  object  type  operation  compatibilities 
in  order  to  arbitrate  access  between  actions.  Incompatibility,  a  conflict,  causes  requesting  actions  to 


If 


\-Vv 


Page  19^ 


Atomicity  in  Operating  Systems 


Appendix  J 


wait  for  the  conflict  to  be  removed  (or  until  some  specified  timeout  occurs).  The  general  form  of  the 
specification  follows. 

Ccompatibility  >  ::  =  <mode  requestingi  > :  ( <mode  heldi  >, ....  <mode  heldn>  | 

access  =  ( <compatibilityi  > ; ...  <compatibilitym>) 

For  example,  access  =  (read  :  (read);  write  :  ID  represents  the  usual  one  writer  /  multiple  reader 
synchronization  assuming  that  there  are  two  operations  on  the  object  type  (read  and  write) 

Locks  can  be  declared  and  then  manipulated  via  the  system  primitives  discussed  in  Section  3.3.  The 
format  of  the  declaration  follows. 

Iockvariable  :  lock  (  <compatibilityj  > ; ...  <compatibilitym>)  domain  =  instanceidtype 

The  order  specification  is  used  to  state  specific  orderings  among  the  operations  which  must  be 
enforced.  Since  this  is  similar  to  path  expressions,  only  an  example  will  be  given  here.  Consider  a 
spool  queue  with  the  operations  of  enter  and  remove.  Assume  there  can  be  only  n  entries  in  the  queue 
maximum  and  that  we  desire  remove  operations  to  wait  if  no  enter  operation  has  been  committed 
relative  to  the  action  invoking  the  remove  operation.  The  specification  might  be  given  as  follows: 

order  =  n  (enter;  remove) 

This  specification  enforces  that  at  least  one  enter  is  committed  before  a  remove  operation  is  allowed 
and  that  at  most  n  enter  operations  can  be  performed  before  a  remove  is  committed 

4.2.  Recovery  Facilities 

The  recovery  facilities  are  again  very  similar  to  the  corresponding  system  primitives  However, 
specifying  the  objectid  is  not  required  (supplied  by  the  runtime  system)  and  each  log  record  is  typed 
by  a  variable  placed  first  in  the  log  record  which  contains  the  name  of  the  operation  which  performed 
the  writelog.  The  format  then  is 

save  (<vari>,  <var2>,. ...  <varn>)  (corresponding  to  ivritelog} 

restore  ( <  logrectype  > ,  <  varj  > ,  <  var2  >,...,  <  varn  > )  (corresponding  to  reading) 


5.  A  Directory  Example 


For  convenience  in  this  example  we  will  use  the  language  notations  presented  in  Section  4.  The 
purpose  of  the  example,  however,  is  not  to  defend  particular  language  constructs,  but  rather  to 
illustrate  the  use  of  the  atomicity  facilities.  In  following  we  analyze  how  a  directory  object  type  might 
be  defined  using  the  facilities  presented  in  this  paper.  We  show  two  possible  approaches  using 
different  levels  of  sophistication  to  achieve  different  amounts  of  concurrency.  The  first  approach 
requires  only  an  access  statement  to  control  action  interleavings.  The  second  requires  an 
alternative  specification  and  minor  programming,  but  achieves  higher  concurrency.  A  more  formal 
treatment  of  this  design  process  in  presented  in  |Allc83]. 

Suppose  we  wish  to  create  an  action-based  directory  object  (such  as  the  one  shown  in  Figure  4  with 
Add  and  Delete  substituted  accordingly)  with  the  following  operations: 


Add{ k  ;  key;  v  :  value) :  status 
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Deletel  k  :  key) :  status 
Lookup^ k :  key) :  status,  value 
ListSeriaUcurrentkey :  key) :  status,  value,  nextkey 
ListApprox( currentkey :  key) :  status,  value,  nextkey 

Let  us  assume  that  all  operations  other  than  ListApprox  require  a  serially  consistent  view  or  the 
directory,  but  that  l.istApprox  has  semantics  such  that  the  invoking  action  will  accept  seeing 
(possibly)  uncommitted  changes.  One  possible  access  specification  is  given  below. 

access  =  (  Add  .[ListApproxY, 

Delete  :  |  ListApprox], 

Lookup  :  1 Lookup ,  ListSerial,  ListApproxY, 

ListSerial  :  [Lookup,  ListSerial,  ListApproxY, 

ListApprox  :  [Add,  Delete,  Lookup,  ListSerial,  ListApprox )) 

This  specification,  although  correct,  may  not  achieve  an  acceptable  level  of  concurrency.  Even 
though  two  actions  could  correctly  operate  on  different  keys  in  the  directory,  this  is  not  allowed  by  the 
specification. 

To  improve  concurency  a  different  access  specification  could  be  used  together  with  programming  to 
specifically  control  directory  entry  sharing  In  the  alternative  specification  below  we  only 
synchronize  the  operation  ListSerial,  the  other  operations  can  be  synchronized  via  the  built-in  lock 
facility. 

access  =  (  Add  :  [Add,  Delete,  Lookup,  ListApproxY, 

Delete  :  [Add,  Delete,  Lookup,  ListApproxY, 

Lookup  :  [Add,  Delete.  Lookup,  IjistSerial,  ListApproxY, 

ListSerial  :  [Lookup,  ListSerial,  ListApproxY, 

ListApprox  :  [Add.  Delete,  Lookup,  ListSerial,  ListApprox |) 

We  declare  a  lock  as  follows: 

x  :  lock  (read  :  (read);  change  :  (I)  domain  =  key 

We  can  then  use  setlock  and  releaselock  to  dynamically  control  action  synchronization  on  the 
directory  entries.  Using  this  approach  the  Add  operation  might  appear  as  follows: 

action  Add (k  :  key;  v  :  value) 
begin 

setlock  (x,  k,  change,  timelimit); 

...  (put  entry  into  the  directory} 

save  (k,  v)  {implemented  through  writelog } 

end; 

Note  that  we  may  lock  instances  which  do  not  exist,  thus  avoiding  the  phantom  problem  and 
preventing  loss  of  serial  consistency.  The  choice  of  synchronizing  ListSerial  at  the  directory  level 
instead  of  the  key  entry  level  avoids  the  overhead  of  acquiring  and  releasing  a  lock  for  each  key. 
Instead  only  one  lock  must  be  accessed.  Thus,  tradeoffs  involving  granule  size  are  possible.  The 
Abort  event  code  might  appear  as  follows  in  the  directory  object. 

entry  procedure  Abort-, 
var 

k :  key;  v  :  value; 

begin 
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restore  (logrectype,  k,  v,  stat); 
while  (stat  x  endoflog)  do 

begin 

case  logrectype  of 

add  :  {remove  entry  from  directory} 
delete  :  {put  entry  back  into  directory} 

end; 

restore  (logrectype,  k,  v,  stat) 

end, 

{locks  are  automatically  released  by  the  runtime  system  using  releaseall} 

end. 


6.  A  Cooperating  Process  Exami 


There  are  many  interactions  among  processes  which  do  not  necessarily  involve  cooperation  of  the 
processes  to  complete  the  actions  the  processes  are  performing.  Consider  message  queues  where  the 
receiving  process  is  not  allowed  to  examine  messages  until  the  sender  process  commits  the  group  of 
messages  sent.  A  typical  example  is  a  spooling  system  where  the  spooler  process  docs  not  see 
producing  processes'  output  until  it  is  committed  In  a  sense  the  output  is  cached  until  the  commit 
occurs  This  situation  appears  quite  common  and  can  easily  be  supported  by  the  atomicity  facilities 
presented.  The  language  structures  associated  with  the  Clouds  system  also  model  this  paradigm 

There  arc  however,  examples  where  multiple  processes  must  interact  to  complete  an  action  We  now 
illustrate  how  the  atomicity  facilities  discussed  in  this  paper  might  be  applied  to  solve  this  class  of 
problems.  Presented  below  is  a  sketch  of  two  processes  which  cooperatively  perform  an  action  As 
discussed  before,  commitment  is  possible  only  if  both  processes  request  commitment  If  one  of  the 
processes  aborts,  the  other  one  can  detect  this  through  the  notify  system  primitive  and  can  then  abort 
the  action  also. 


Process  i: 

create  action 
Send  actionid 
link  to  action 
repeat  until  done 

check  action  status 
if  action  aborted,  then  abort 
do  work 

write  log  records  as  necessary 
if  error  abort 
send  message 

end 


Process2t 

receive  actionid 
link  to  action 
repeat  until  done 

check  action  status 
if  action  aborted,  then  abort 
receive  message 
do  work 

write  log  records  as  necessary 
if  error  abort 
end 


commit 


commit 


Processj  *s  responsible  for  initially  creating  the  action;  it  then  communicutes  the  actionid  to  process^ 
Both  processes  then  link  to  the  action  and  begin  processing.  Both  processes  must  occasionally  check 
the  state  of  the  action  and  abort  if  the  other  process  has  already  aborted  the  action.  Only  if  both  reach 
commit  will  the  action  actually  be  committed. 


The  processes  can  define  the  objects  in  any  manner  that  is  convenient  since  the  primitives  primarily 
use  the  objectids  as  simply  a  manner  to  structure  the  logs.  If  the  two  processes  manipulate  the  same 
object,  the  first  one  to  issue  touchobject  is  the  process  responsible  for  performing  the  action  event 
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procedures  (e.g.,  EOA).  As  desired,  both  processes,  because  they  are  performing  the  same  action,  will 
write  to  the  same  log.  Of  course,  if  the  processes  could  enter  a  shared  object  concurrently,  standard 
process  synchronization  must  be  used.  Since  they  are  performing  the  same  action,  interference  can 
not  be  prevented  by  action  synchronization. 


7.  Summary 


This  paper  has  explored  the  issues  involved  with  integrating  facilities  to  support  atomicity  into  the 
kernel  of  an  operating  system.  For  generality  these  facilities  should  not  bind  actions  and  processes 
tightly  permitting  either  a  single  process  or  multiple  processes  to  perform  an  action.  It  has  been 
suggested  that  a  more  general  type  of  atomicity,  semantic  atomicity,  is  desirable  for  efficiency  in  some 
cases.  It  has  been  proposed  that  system  designers  and  programmers  be  given  direct  control  over 
accomplishing  atomicity  (both  concurrency  and  failure).  We  have  presented  a  set  of  requirements  for 
supporting  these  kinds  of  tools.  These  requirements  include  the  ability  to  create  and  terminate 
actions,  to  control  concurrency  between  actions,  to  recover  from  action  failures,  to  perform  special 
processing  on  transitions  in  the  state  of  actions,  and  to  incorporate  process  agreement  facilities  which 
allow  processes  performing  an  action  to  reach  a  consensus  concerning  action  state  transitions.  A  set 
of  kernel  primitives  for  atomicity  was  presented  within  a  framework  of  processes,  actions,  and  objects. 
The  generality  for  message-oriented  systems  was  also  discussed.  The  mechanisms  appear  especially 
imporlant  in  distributed  environments.  A  distributed  operating  system  environment  was  used  to 
demonstrate  one  possible  approach  for  actual  integration  of  the  primitives.  Finally  two  examples 
were  presented:  a  directory  system  (using  some  semantic  knowledge  concerning  the  actions  operating 
on  the  directory)  and  two  processes  cooperatively  performing  an  action. 

Our  work  has  addressed  a  fundamental  problem  confronting  operating  systems,  particularly 
distributed  ones.  It  appears  that  a  significant  advance  in  reliability  and  system  organization  might 
lie  possible  with  a  well  engineered  set  of  orthogonal  mechanisms  to  address  atomicity. 
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